JudgementalGPT
JudgementalGPT is an LLM agent developed in-house by Confident AI that's dedicated to evaluation and is superior to GEval. While it operates similarly to GEval by utilizing LLMs for scoring, it:
- offers enhanced accuracy and reliability
- is capable of generating justifications in different languages
- has the ability to conditionally execute code that helps detect logical fallacies during evaluations
Required Arguments
To use JudgementalGPT, you'll have to provide the following arguments when creating an LLMTestCase:
inputactual_output
Similar to GEval, you'll also need to supply any additional arguments such as expected_output and context if your evaluation criteria depends on these parameters.
Example
To use JudgementalGPT, start by logging into Confident AI:
deepeval login
Then paste in the following code to define a metric powered by JudgementalGPT:
from deepeval.types import Languages
from deepeval.metrics import JudgementalGPT
from deepeval.test_case import LLMTestCaseParams
code_correctness_metric = JudgementalGPT(
name="Code Correctness",
criteria="Code Correctness - determine whether the code in the 'actual output' produces a valid JSON.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
language=Languages.SPANISH,
threshold=0.5,
)
Under the hood, JudgementalGPT sends a request to Confident AI's servers that hosts JudgementalGPT. JudgementalGPT accepts four arguments:
name: name of metriccriteria: a description outlining the specific evaluation aspects for each test case.evaluation_params: a list of typeLLMTestCaseParams. Include only the parameters that are relevant for evaluation.- [Optional]
language: typeLanguage, specifies what language to return the reasoning in. - [Optional]
threshold: the passing threshold, defaulted to 0.5.
Similar to GEval, you can access the judgemental score and reason for JudgementalGPT:
from deepeval.test_case import LLMTestCase
...
test_case = LLMTestCase(
input="Show me a valid json",
actual_output="{'valid': 'json'}"
)
code_correctness_metric.measure(test_case)
print(code_correctness_metric.score)
print(code_correctness_metric.reason)