Test Runs
A test run on Confident AI is a collection of evaluated test cases and their corresponding metric scores. There are two ways to produce a test run on Confident AI:
- running an experiment directly on Confident AI's infrastructure by sending test cases to Confident AI
- evaluating test cases locally on
deepeval
and sending it to Confident AI after evaluation
Running A Test Run on Confident AI
You can evaluate your LLM application on Confident AI's infrastructure by first creating an experiment on Confident AI and define metrics of your choice before sending over test cases (with actual_output
s populated by your LLM application) for evaluation.
from deepeval import confident_evaluate
from deepeval.test_case import LLMTestCase
confident_evaluate(
experiment_name="My First Experiment",
test_cases=[LLMTestCase(...)]
)
You should refer to the experiments section if you're unsure what an experiment is on Confident AI.
There are two mandatory and one optional arguments when calling the confident_evaluate()
function:
experiment_name
: a string that specifies the name of the experiment you wish to evaluate your test cases against on Confident AI.test_cases
: a list ofLLMTestCase
s/ConversationalTestCase
s OR anEvaluationDataset
. Confident AI will evaluate your LLM application using the metrics you defined for this particular experiment for all test cases.disable_browser_opening
: a boolean which when set toTrue
, will disable the auto-opening of the browser, which brings you to the experiment page ofexperiment_name
.
Running A Test Run via deepeval
We're going to show a full example which includes the datasets feature from Confident AI in this example. You can similarly apply the confident_evaluate()
function to the examples below to run evaluations on Confident AI instead.
Using the evaluate()
Function
You can use deepeval
's evaluate()
function to run a test run. All test run results will be automatically sent to Confident AI if you're logged in.
First, pull the dataset:
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull(
alias="Your Dataset Alias",
# Don't convert goldens to test cases yet
auto_convert_goldens_to_test_cases=False
)
Since we're going to be generating actual_output
s at evaluation time, we're NOT going to convert the pulled goldens to test cases yet.
If you're unfamilar with what we're doing with an EvaluationDataset
, please first visit the datasets section.
With the dataset pulled, we can now generate the actual_output
s for each golden and populate the evaluation dataset with test cases that are ready for evaluation (we're using a hypothetical llm_app
to generate actual_output
s in this example):
# A hypothetical LLM application example
import llm_app
from typing import List
from deepeval.test_case import LLMTestCase
from deepeval.dataset import Golden
...
def convert_goldens_to_test_cases(goldens: List[Golden]) -> List[LLMTestCase]:
test_cases = []
for golden in goldens:
test_case = LLMTestCase(
input=golden.input,
# Generate actual output using the 'input'.
# Replace this with your actual LLM application
actual_output=llm_app.generate(golden.input),
expected_output=golden.expected_output,
context=golden.context,
retrieval_context=golden.retrieval_context
)
test_cases.append(test_case)
return test_cases
# Data preprocessing before setting the dataset test cases
dataset.test_cases = convert_goldens_to_test_cases(dataset.goldens)
Now that your dataset is ready, define your metrics. Here, we'll be using the AnswerRelevancyMetric
to keep things simple:
from deepeval.metrics import AnswerRelevancyMetric
metric = AnswerRelevancyMetric()
Lastly, using the metrics you've defined and dataset you've pulled and generated actual_output
s for, execute your test run:
import evaluate
...
evaluate(dataset, metrics=[metric])
Using the deepeval test run
Command
Alternatively, you can also execute a test run via deepeval
's Pytest integration, which similar to the evaluate()
function, will automatically send all test run results to Confident AI. Here is a full example that is analogous to the one above:
# A hypothetical LLM application example
import llm_app
from typing import List
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset, Golden
from deepeval import assert_test
dataset = EvaluationDataset()
dataset.pull(
alias="Your Dataset Alias",
# Don't convert goldens to test cases yet
auto_convert_goldens_to_test_cases=False
)
def convert_goldens_to_test_cases(goldens: List[Golden]) -> List[LLMTestCase]:
test_cases = []
for golden in goldens:
test_case = LLMTestCase(
input=golden.input,
# Generate actual output using the 'input'.
# Replace this with your actual LLM application
actual_output=llm_app.generate(golden.input),
expected_output=golden.expected_output,
context=golden.context,
retrieval_context=golden.retrieval_context
)
test_cases.append(test_case)
return test_cases
# Data preprocessing before setting the dataset test cases
dataset.test_cases = convert_goldens_to_test_cases(dataset.goldens)
@pytest.mark.parametrize(
"test_case",
dataset,
)
def test_llm_app(test_case: LLMTestCase):
metric = AnswerRelevancyMetric()
assert_test(test_case, [metric])
And execute your test file via the CLI:
deepeval test run test_llm_app.py
Some users prefer deepeval test run
over evaluate()
is because it integrates well with CI/CD pipelines such as Github Actions. Another reason why deepeval test run
is prefered is because you can easily spin up multiple processes to evaluate multiple test cases in parallel:
deepeval test run test_llm_app.py -n 10
You can learn more about all the functionalities deepeval test run
offers here.
In CI/CD Pipelines
You can also execute run a test run in CI/CD pipelines by executing deepeval test run
to regression test your LLM application and have the results sent to Confident AI. Here is an example of how you would do it in GitHub Actions:
name: LLM Regression Test
on:
push:
pull_request:
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python3 -
echo "$HOME/.local/bin" >> $GITHUB_PATH
- name: Install Dependencies
run: poetry install --no-root
- name: Login to Confident AI
env:
CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}
run: poetry run deepeval login --confident-api-key "$CONFIDENT_API_KEY"
- name: Run DeepEval Test Run
run: poetry run deepeval test run test_llm_app.py
You don't have to necessarily use poetry
for installation or follow each step exactly as presented. We're merely showing an example of how a sample yaml
file to execute a deepeval test run
would look like.
Test Runs On Confident AI
All test runs created, on Confident AI or locally via deepeval
(ie. evaluate()
or deepeval test run
), are automatically sent to Confident AI for anyone in your organization to view, share, download, and comment on.
Confident AI also keeps track of additional metadata such as time taken for a test run to finish executing, and the evaluation cost associated with this test run.
Viewing Test Cases
Download Test Cases
Comment On Test Runs
Iterating on Hyperparameters
This section is currently only available if you're running evaluations locally via deepeval
.
In the first section, we saw how to execute a test run by dynamically generating actual_output
for pulled datasets at evaluation time. To experiemnt with different hyperparameter values to iterate towards the optimal LLM to use for your LLM application for example, you should associate hyperparameter values to test runs to compare and pick the best hyperparameters on Confident AI.
Continuing from the previous example, if for example llm_app
uses gpt-4o
as its LLM, you can associate the gpt-4o
value as such:
...
evaluate(
test_cases=[test_case],
metrics=[metric],
hyperparameters={"model": "gpt4o", "prompt template": "..."}
)
You can run a heavily nested for loop to generate actual_output
s to get test run evaluation results for all hyperparameter combinations.
for model in models:
for prompt in prompts:
evaluate(..., hyperparameters={...})
Or, if you're using deepeval test run
:
...
# You should aim to make these values dynamic
@deepeval.log_hyperparameters(model="gpt-4o", prompt_template="...")
def hyperparameters():
# Return a custom flat dict to log any additional hyperparameters.
return {}
This will help Confident AI identify which hyperparameter combination were used to generate each test run results, which you can ultimately easily filter for and visualize experiments on the platform.