Datasets
Quick Summary
In deepeval
, am evaluation dataset, or just dataset, is a collection of LLMTestCase
s. There are two approaches to evaluating datasets in deepeval
:
- using
@pytest.mark.parametrize
andassert_test
- using
evaluate
Create An Evaluation Dataset
An EvaluationDataset
in deepeval
is simply a collection of LLMTestCase
s and/or Golden
s.
A Golden
is extremely very similar to an LLMTestCase
, but they are more flexible as they do not require an actual_output
at initialization. On the flip side, whilst test cases are always ready for evaluation, a golden isn't.
With Test Cases
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
first_test_case = LLMTestCase(input="...", actual_output="...")
second_test_case = LLMTestCase(input="...", actual_output="...")
test_cases = [first_test_case, second_test_case]
dataset = EvaluationDataset(
# alias is optional, but helps you
# identify your dataset on Confident AI
alias="My first dataset",
test_cases=test_cases
)
You can also append a test case to an EvaluationDataset
through the test_cases
instance variable:
...
dataset.test_cases.append(test_case)
# or
dataset.add_test_case(test_case)
With Goldens
You should opt to initialize EvaluationDataset
s with goldens if you're looking to generate LLM outputs at evaluation time. This usually means your original dataset does not contain precomputed outputs, but only the inputs you want to evaluate your LLM (application) on.
from deepeval.dataset import EvaluationDataset, Golden
first_golden = Golden(input="...")
second_golden = Golden(input="...")
goldens = [first_golden, second_golden]
dataset = EvaluationDataset(
# alias is optional, but helps you
# identify your dataset on Confident AI
alias="My first dataset",
goldens=goldens
)
A Golden
and LLMTestCase
contains almost an identical class signature, so technically you can also supply other parameters such as the actual_output
when creating a Golden
.
Generate An Evaluation Dataset
You can generate EvaluationDataset
s using deepeval
's Synthesizer
class. The Synthesizer
class is a synthetic data generator that first uses an LLM to generate a series of inputs
based on a list of provided context
, before evolving each input
to make them more complex and realistic. These evolved input
s, are then used to create a list of goldens, which will form your EvaluationDataset
.
from deepeval.synthesizer import Synthesizer
from deepeval.dataset import EvaluationDataset
synthesizer = Synthesizer()
contexts = [
["The Earth revolves around the Sun.", "Planets are celestial bodies."],
["Water freezes at 0 degrees Celsius.", "The chemical formula for water is H2O."],
]
# Use synthesizer directly
synthesizer.generate_goldens(contexts=contexts)
synthesizer.save_as(
# also accepts 'csv'
file_type='json',
path="./synthetic_data"
)
# Use synthesizer within an EvaluationDataset
dataset = EvaluationDataset()
dataset.generate_goldens(
synthesizer=synthesizer,
contexts=contexts
)
dataset.save_as(
# also accepts 'csv'
file_type='json',
path="./synthetic_data"
)
There are two optional parameters when creating a Synthesizer
:
- [Optional]
model
: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted to 'gpt-4-0125-preview'. - [Optional]
multithreading
: a boolean which when set toTrue
, enables concurrent generation of goldens. Defaulted toTrue
.
We highly recommend you to call save_as()
to save all generated synthetic data.
Load an Existing Dataset
deepeval
offers support for loading datasetes stored in JSON files, CSV files, and hugging face datasets into an EvaluationDataset
as test cases.
From JSON
You can add test cases into your EvaluationDataset
by supplying a file_path
to your .json
file. Your .json
file should contain an array of objects (or list of dictionaries).
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_test_cases_from_json_file(
# file_path is the absolute path to you .json file
file_path="example.json",
input_key_name="query",
actual_output_key_name="actual_output",
expected_output_key_name="expected_output",
context_key_name="context",
)
From CSV
You can add test cases into your EvaluationDataset
by supplying a file_path
to your .csv
file. Your .csv
file should contain rows that can be mapped into LLMTestCase
s through their column names. Remember, context
should be a list of strings and in the context of CSV files, it means you have to supply a context_col_delimiter
argument to tell deepeval
how to split your context cells into a list of strings.
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_test_cases_from_csv_file(
# file_path is the absolute path to you .csv file
file_path="example.csv",
input_col_name="query",
actual_output_col_name="actual_output",
expected_output_col_name="expected_output",
context_col_name="context",
context_col_delimiter= ";"
)
From Hugging Face
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_test_cases_from_hf_dataset(
dataset_name="Example HF dataset name",
input_field_name="query",
actual_output_field_name="actual_output",
expected_output_field_name="expected_output",
context_field_name="context",
split="train",
)
Since expected_output
and context
are optional parameters for an LLMTestCase
, expected output and context fields are similarily optional parameters when adding test cases from an existing dataset.
Evaluate Your Dataset With Pytest
Before we begin, we highly recommend logging into Confident AI to keep track of all evaluation results on the cloud:
deepeval login
deepeval
utilizes the @pytest.mark.parametrize
decorator to loop through entire datasets.
import deepeval
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset(test_cases=[...])
@pytest.mark.parametrize(
"test_case",
dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
hallucination_metric = HallucinationMetric(threshold=0.3)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
assert_test(test_case, [hallucination_metric, answer_relevancy_metric])
@deepeval.on_test_run_end
def function_to_be_called_after_test_run():
print("Test finished!")
Iterating through an dataset
object implicitly loops through the test cases in an dataset
. To iterate through goldens, you can do it by accessing dataset.goldens
instead.
To run several tests cases at once in parallel, use the optional -n
flag followed by a number (that determines the number of processes that will be used) when executing deepeval test run
:
deepeval test run test_bulk.py -n 3
Evaluate Your Dataset Without Pytest
Alternately, you can use deepeval's evaluate
function to evaluate datasets. This approach avoids the CLI, but does not allow for parallel test execution.
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset(test_cases=[...])
hallucination_metric = HallucinationMetric(threshold=0.3)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
dataset.evaluate([hallucination_metric, answer_relevancy_metric])
# You can also call the evaluate() function directly
evaluate(dataset, [hallucination_metric, answer_relevancy_metric])