Metrics

Quick Summary

In deepeval, a metric serves as a standard of measurement for evaluating the performance of an LLM output based on a specific criteria of interest. Essentially, while the metric acts as the ruler, a test case represents the thing you're trying to measure. deepeval offers a range of default metrics for you to quickly get started with, such as:

G-Eval
Summarization
Faithfulness
Answer Relevancy
Contextual Relevancy
Contextual Precision
Contextual Recall
Ragas
Hallucination
Toxicity
Bias

deepeval also offers you a straightforward way to develop your own custom evaluation metrics. All metrics are measured on a test case. Visit the test cases section to learn how to apply any metric on test cases for evaluation.

Types of Metrics

A custom metric is a type of metric you can easily create by implementing abstract methods and properties of base classes provided by deepeval. They are extremely versitle and seamlessly integrate with Confident AI without requiring any additional setup. As you'll see later, a custom metric can either be an LLM-Eval (LLM evaluated) or classic metric. A classic metric is a type of metric whose criteria isn't evaluated using an LLM.

deepeval also offer default metrics. Most default metrics offered by deepeval are LLM-Evals, which means they are evaluated using LLMs. This is delibrate because LLM-Evals are versitle in nature and better aligns with human expectations when compared to traditional model based approaches.

deepeval's LLM-Evals are a step up to other implementations because they:

are extra reliable as LLMs are only used for extremely specific tasks during evaluation to greatly reduce stochasticity and flakiness in scores.
provide a comprehensive reason for the scores computed.

All of deepeval's default metrics output a score between 0-1, and require a threshold argument to instantiate. A default metric is only successful if the evaluation score is equal to or greater than threshold.

info

All GPT models from OpenAI are available for LLM-Evals (metrics that use LLMs for evaluation). You can switch between models by providing a string corresponding to OpenAI's model names via the optional model argument when instantiating an LLM-Eval.

Using OpenAI

To use OpenAI for deepeval's LLM-Evals (metrics evaluated using an LLM), supply your OPENAI_API_KEY in the CLI:

export OPENAI_API_KEY=<your-openai-api-key>

Alternatively, if you're working in a notebook enviornment (Jupyter or Colab), set your OPENAI_API_KEY in a cell:

 %env OPENAI_API_KEY=<your-openai-api-key>

note

Please do not include quotation marks when setting your OPENAI_API_KEY if you're working in a notebook enviornment.

Using Azure OpenAI

deepeval also allows you to use Azure OpenAI for metrics that are evaluated using an LLM. Run the following command in the CLI to configure your deepeval enviornment to use Azure OpenAI for all LLM-based metrics.

deepeval set-azure-openai --openai-endpoint=<endpoint> \
    --openai-api-key=<api_key> \
    --deployment-name=<deployment_name> \
    --openai-api-version=<openai_api_version> \
    --model-version=<model_version>

Note that the model-version is optional. If you ever wish to stop using Azure OpenAI and move back to regular OpenAI, simply run:

deepeval unset-azure-openai

Using a Custom LLM

deepeval allows you to use ANY custom LLM for evaluation. This includes LLMs from langchain's chat_model module, Hugging Face's transformers library, or even LLMs in GGML format.

caution

We CANNOT guarantee that evaluations will work as expected when using a custom model. This is because evaluation requires high levels of reasoning and the ability to follow instructions such as outputing responses in JSON formats, which are generally hard to achieve with custom LLMs.

Azure OpenAI Example

Here is an example of creating a custom Azure OpenAI model through langchain's AzureChatOpenAI module for evaluation:

from langchain_openai import AzureChatOpenAI
from deepeval.models.base_model import DeepEvalBaseLLM

class AzureOpenAI(DeepEvalBaseLLM):
    def __init__(
        self,
        model
    ):
        self.model = model

    def load_model(self):
        return self.model

    def _call(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    def get_model_name(self):
        return "Custom Azure OpenAI Model"

# Replace these with real values
custom_model = AzureChatOpenAI(
    openai_api_version=openai_api_version,
    azure_deployment=azure_deployment,
    azure_endpoint=azure_endpoint,
    openai_api_key=openai_api_key,
)
azure_openai = AzureOpenAI(model=custom_model)
print(azure_openai("Write me a joke"))

When creating a custom LLM evaluation model you should ALWAYS:

inherit DeepEvalBaseLLM.
implement the load_model() method, which will be responsible for returning a model object.
implement the _call() method with one and only one parameter of type string that acts as the prompt to your custom LLM.
the _call() method should return the final output string of your custom LLM. Note that we called chat_model.invoke(prompt).content to access the model output in this particular example, but this could be different depending on the implementation of your custom LLM object.
the get_model_name() method simply returns a string representing the name of your LLM model.

Note that the model argument in the __init__() method can accept any type (the model string or object itself). Lastly, to use it for evaluation in LLM-based metrics:

from deepeval.metrics import AnswerRelevancyMetric
...

metric = AnswerRelevancyMetric(model=azure_openai)

note

While the Azure OpenAI command configures deepeval to use Azure OpenAI globally for all LLM-Evals, a custom LLM has to be set each time you instantiate a metric. Remember to provide your custom LLM instance through the model parameter for metrics you wish to use it for.

Mistral 7B Example

Here is an example of creating a custom Mistral 7B model through Hugging Face's transformers library for evaluation:

from transformers import AutoModelForCausalLM, AutoTokenizer
from deepeval.models.base_model import DeepEvalBaseLLM

class Mistral7B(DeepEvalBaseLLM):
    def __init__(
        self,
        model,
        tokenizer
    ):
        self.model = model
        self.tokenizer = tokenizer

    def load_model(self):
        return self.model

    def _call(self, prompt: str) -> str:
        model = self.load_model()

        device = "cuda" # the device to load the model onto

        model_inputs = self.tokenizer([prompt], return_tensors="pt").to(device)
        model.to(device)

        generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
        return self.tokenizer.batch_decode(generated_ids)[0]

    def get_model_name(self):
        return "Mistral 7B"

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

mistral_7b = Mistral7B(model=model, tokenizer=tokenizer)
print(mistral_7b("Write me a joke"))

Note that for this particular implementation, we initialized our Mistral7B model with an additional tokenizer parameter, as this is required in the decoding step of the _call() method, unlike the AzureOpenAI example above. Lastly, to use it for evaluation in LLM-based metrics:

from deepeval.metrics import AnswerRelevancyMetric
...

metric = AnswerRelevancyMetric(model=mistral_7b)

tip

You need to specify the custom evaluation model you would like to use through the model argument when instantiating an LLM-based metric.

AWS Bedrock Example

Here is an example of creating a custom AWS Bedrock model through the langchain_community.chat_models module for evaluation:

from langchain_community.chat_models import BedrockChat
from deepeval.models.base_model import DeepEvalBaseLLM

class AWSBedrock(DeepEvalBaseLLM):
    def __init__(
        self,
        model
    ):
        self.model = model

    def load_model(self):
        return self.model

    def _call(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    def get_model_name(self):
        return "Custom Azure OpenAI Model"

# Replace these with real values
custom_model = BedrockChat(
    credentials_profile_name=<your-profile-name>, # e.g. "default"
    region_name=<your-region-name>, # e.g. "us-east-1"
    endpoint_url=<your-bedrock-endpoint>, # e.g. "https://bedrock-runtime.us-east-1.amazonaws.com"
    model_id=<your-model-id>, # e.g. "anthropic.claude-v2"
    model_kwargs={"temperature": 0.4},
)

aws_bedrock = AWSBedrock(model=custom_model)
print(aws_bedrock("Write me a joke"))

Finally, supply the newly created aws_bedrock model to LLM-Evals:

from deepeval.metrics import AnswerRelevancyMetric
...

metric = AnswerRelevancyMetric(model=aws_bedrock)

Measuring a Metric

All metrics in deepeval, including custom metrics that you create:

can be executed via the metric.measure() method
can have its score accessed via metric.score
can have its status accessed via metric.is_successful()
can be used to evaluate test cases or entire datasets, with or without Pytest.
has a threshold that acts as the threshold for success. metric.is_successful() is only true if metric.score >= threshold.

In additional, most LLM-Evals in deepeval offers a reason for its score, which can be accessed via metric.reason.

Here's a quick example.

export OPENAI_API_KEY=<your-openai-api-key>

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# Initialize a test case
test_case = LLMTestCase(
    input="...",
    actual_output="...",
    retrieval_context=["..."]
)

# Initialize metric with threshold
metric = AnswerRelevancyMetric(threshold=0.5)

Using this metric, you can either execute it directly as a standalone to get its score and reason:

...

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Or you can either evaluate a test case using deepeval test run:

test_file.py
from deepeval import assert_test
...

def test_answer_relevancy():
    assert_test(test_case, [metric])

deepeval test run test_file.py

Or using the evaluate function:

from deepeval import evaluate
...

evaluate([test_case], [metric])

For more details on how a metric evaluates a test case, refer to the test cases section.

Quick Summary​

Types of Metrics​

Using OpenAI​

Using Azure OpenAI​

Using a Custom LLM​

Azure OpenAI Example​

Mistral 7B Example​

AWS Bedrock Example​

Measuring a Metric​