Metrics
Quick Summary
In deepeval
, a metric serves as a standard of measurement for evaluating the performance of an LLM output based on a specific criteria of interest. Essentially, while the metric acts as the ruler, a test case represents the thing you're trying to measure. deepeval
offers a range of default metrics for you to quickly get started with, such as:
- G-Eval
- Summarization
- Faithfulness
- Answer Relevancy
- Contextual Relevancy
- Contextual Precision
- Contextual Recall
- Ragas
- Hallucination
- Toxicity
- Bias
deepeval
also offers you a straightforward way to develop your own custom evaluation metrics. All metrics are measured on a test case. Visit the test cases section to learn how to apply any metric on test cases for evaluation.
Types of Metrics
A custom metric is a type of metric you can easily create by implementing abstract methods and properties of base classes provided by deepeval
. They are extremely versitle and seamlessly integrate with Confident AI without requiring any additional setup. As you'll see later, a custom metric can either be an LLM-Eval (LLM evaluated) or classic metric. A classic metric is a type of metric whose criteria isn't evaluated using an LLM.
deepeval
also offer default metrics. Most default metrics offered by deepeval
are LLM-Evals, which means they are evaluated using LLMs. This is delibrate because LLM-Evals are versitle in nature and better aligns with human expectations when compared to traditional model based approaches.
deepeval
's LLM-Evals are a step up to other implementations because they:
- are extra reliable as LLMs are only used for extremely specific tasks during evaluation to greatly reduce stochasticity and flakiness in scores.
- provide a comprehensive reason for the scores computed.
All of deepeval
's default metrics output a score between 0-1, and require a threshold
argument to instantiate. A default metric is only successful if the evaluation score is equal to or greater than threshold
.
All GPT models from OpenAI are available for LLM-Evals (metrics that use LLMs for evaluation). You can switch between models by providing a string corresponding to OpenAI's model names via the optional model
argument when instantiating an LLM-Eval.
Using OpenAI
To use OpenAI for deepeval
's LLM-Evals (metrics evaluated using an LLM), supply your OPENAI_API_KEY
in the CLI:
export OPENAI_API_KEY=<your-openai-api-key>
Alternatively, if you're working in a notebook enviornment (Jupyter or Colab), set your OPENAI_API_KEY
in a cell:
%env OPENAI_API_KEY=<your-openai-api-key>
Please do not include quotation marks when setting your OPENAI_API_KEY
if you're working in a notebook enviornment.
Using Azure OpenAI
deepeval
also allows you to use Azure OpenAI for metrics that are evaluated using an LLM. Run the following command in the CLI to configure your deepeval
enviornment to use Azure OpenAI for all LLM-based metrics.
deepeval set-azure-openai --openai-endpoint=<endpoint> \
--openai-api-key=<api_key> \
--deployment-name=<deployment_name> \
--openai-api-version=<openai_api_version> \
--model-version=<model_version>
Note that the model-version
is optional. If you ever wish to stop using Azure OpenAI and move back to regular OpenAI, simply run:
deepeval unset-azure-openai
Using a Custom LLM
deepeval
allows you to use ANY custom LLM for evaluation. This includes LLMs from langchain's chat_model
module, Hugging Face's transformers
library, or even LLMs in GGML format.
We CANNOT guarantee that evaluations will work as expected when using a custom model. This is because evaluation requires high levels of reasoning and the ability to follow instructions such as outputing responses in JSON formats, which are generally hard to achieve with custom LLMs.
Azure OpenAI Example
Here is an example of creating a custom Azure OpenAI model through langchain's AzureChatOpenAI
module for evaluation:
from langchain_openai import AzureChatOpenAI
from deepeval.models.base_model import DeepEvalBaseLLM
class AzureOpenAI(DeepEvalBaseLLM):
def __init__(
self,
model
):
self.model = model
def load_model(self):
return self.model
def _call(self, prompt: str) -> str:
chat_model = self.load_model()
return chat_model.invoke(prompt).content
def get_model_name(self):
return "Custom Azure OpenAI Model"
# Replace these with real values
custom_model = AzureChatOpenAI(
openai_api_version=openai_api_version,
azure_deployment=azure_deployment,
azure_endpoint=azure_endpoint,
openai_api_key=openai_api_key,
)
azure_openai = AzureOpenAI(model=custom_model)
print(azure_openai("Write me a joke"))
When creating a custom LLM evaluation model you should ALWAYS:
- inherit
DeepEvalBaseLLM
. - implement the
load_model()
method, which will be responsible for returning a model object. - implement the
_call()
method with one and only one parameter of type string that acts as the prompt to your custom LLM. - the
_call()
method should return the final output string of your custom LLM. Note that we calledchat_model.invoke(prompt).content
to access the model output in this particular example, but this could be different depending on the implementation of your custom LLM object. - the
get_model_name()
method simply returns a string representing the name of your LLM model.
Note that the model
argument in the __init__()
method can accept any type (the model string or object itself). Lastly, to use it for evaluation in LLM-based metrics:
from deepeval.metrics import AnswerRelevancyMetric
...
metric = AnswerRelevancyMetric(model=azure_openai)
While the Azure OpenAI command configures deepeval
to use Azure OpenAI globally for all LLM-Evals, a custom LLM has to be set each time you instantiate a metric. Remember to provide your custom LLM instance through the model
parameter for metrics you wish to use it for.
Mistral 7B Example
Here is an example of creating a custom Mistral 7B model through Hugging Face's transformers
library for evaluation:
from transformers import AutoModelForCausalLM, AutoTokenizer
from deepeval.models.base_model import DeepEvalBaseLLM
class Mistral7B(DeepEvalBaseLLM):
def __init__(
self,
model,
tokenizer
):
self.model = model
self.tokenizer = tokenizer
def load_model(self):
return self.model
def _call(self, prompt: str) -> str:
model = self.load_model()
device = "cuda" # the device to load the model onto
model_inputs = self.tokenizer([prompt], return_tensors="pt").to(device)
model.to(device)
generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
return self.tokenizer.batch_decode(generated_ids)[0]
def get_model_name(self):
return "Mistral 7B"
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
mistral_7b = Mistral7B(model=model, tokenizer=tokenizer)
print(mistral_7b("Write me a joke"))
Note that for this particular implementation, we initialized our Mistral7B
model with an additional tokenizer
parameter, as this is required in the decoding step of the _call()
method, unlike the AzureOpenAI
example above. Lastly, to use it for evaluation in LLM-based metrics:
from deepeval.metrics import AnswerRelevancyMetric
...
metric = AnswerRelevancyMetric(model=mistral_7b)
You need to specify the custom evaluation model you would like to use through the model
argument when instantiating an LLM-based metric.
AWS Bedrock Example
Here is an example of creating a custom AWS Bedrock model through the langchain_community.chat_models
module for evaluation:
from langchain_community.chat_models import BedrockChat
from deepeval.models.base_model import DeepEvalBaseLLM
class AWSBedrock(DeepEvalBaseLLM):
def __init__(
self,
model
):
self.model = model
def load_model(self):
return self.model
def _call(self, prompt: str) -> str:
chat_model = self.load_model()
return chat_model.invoke(prompt).content
def get_model_name(self):
return "Custom Azure OpenAI Model"
# Replace these with real values
custom_model = BedrockChat(
credentials_profile_name=<your-profile-name>, # e.g. "default"
region_name=<your-region-name>, # e.g. "us-east-1"
endpoint_url=<your-bedrock-endpoint>, # e.g. "https://bedrock-runtime.us-east-1.amazonaws.com"
model_id=<your-model-id>, # e.g. "anthropic.claude-v2"
model_kwargs={"temperature": 0.4},
)
aws_bedrock = AWSBedrock(model=custom_model)
print(aws_bedrock("Write me a joke"))
Finally, supply the newly created aws_bedrock
model to LLM-Evals:
from deepeval.metrics import AnswerRelevancyMetric
...
metric = AnswerRelevancyMetric(model=aws_bedrock)
Measuring a Metric
All metrics in deepeval
, including custom metrics that you create:
- can be executed via the
metric.measure()
method - can have its score accessed via
metric.score
- can have its status accessed via
metric.is_successful()
- can be used to evaluate test cases or entire datasets, with or without Pytest.
- has a
threshold
that acts as the threshold for success.metric.is_successful()
is only true ifmetric.score
>=threshold
.
In additional, most LLM-Evals in deepeval
offers a reason for its score, which can be accessed via metric.reason
.
Here's a quick example.
export OPENAI_API_KEY=<your-openai-api-key>
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
# Initialize a test case
test_case = LLMTestCase(
input="...",
actual_output="...",
retrieval_context=["..."]
)
# Initialize metric with threshold
metric = AnswerRelevancyMetric(threshold=0.5)
Using this metric, you can either execute it directly as a standalone to get its score and reason:
...
metric.measure(test_case)
print(metric.score)
print(metric.reason)
Or you can either evaluate a test case using deepeval test run
:
from deepeval import assert_test
...
def test_answer_relevancy():
assert_test(test_case, [metric])
deepeval test run test_file.py
Or using the evaluate
function:
from deepeval import evaluate
...
evaluate([test_case], [metric])
For more details on how a metric evaluates a test case, refer to the test cases section.