JudgementalGPT

JudgementalGPT is an LLM agent developed in-house by Confident AI that's dedicated to evaluation and is superior to GEval. While it operates similarly to GEval by utilizing LLMs for scoring, it:

offers enhanced accuracy and reliability
is capable of generating justifications in different languages
has the ability to conditionally execute code that helps detect logical fallacies during evaluations

Required Arguments

To use JudgementalGPT, you'll have to provide the following arguments when creating an LLMTestCase:

input
actual_output

Similar to GEval, you'll also need to supply any additional arguments such as expected_output and context if your evaluation criteria depends on these parameters.

Example

To use JudgementalGPT, start by logging into Confident AI:

deepeval login

Then paste in the following code to define a metric powered by JudgementalGPT:

from deepeval.types import Languages
from deepeval.metrics import JudgementalGPT
from deepeval.test_case import LLMTestCaseParams

code_correctness_metric = JudgementalGPT(
    name="Code Correctness",
    criteria="Code Correctness - determine whether the code in the 'actual output' produces a valid JSON.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    language=Languages.SPANISH,
    threshold=0.5,
)

Under the hood, JudgementalGPT sends a request to Confident AI's servers that hosts JudgementalGPT. JudgementalGPT accepts four arguments:

name: name of metric
criteria: a description outlining the specific evaluation aspects for each test case.
evaluation_params: a list of type LLMTestCaseParams. Include only the parameters that are relevant for evaluation.
[Optional] language: type Language, specifies what language to return the reasoning in.
[Optional] threshold: the passing threshold, defaulted to 0.5.

Similar to GEval, you can access the judgemental score and reason for JudgementalGPT:

from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(
    input="Show me a valid json",
    actual_output="{'valid': 'json'}"
)

code_correctness_metric.measure(test_case)
print(code_correctness_metric.score)
print(code_correctness_metric.reason)

Required Arguments​

Example​

Required Arguments

Example