Reference for agentic evaluations
Find technical reference material for roles, metrics, and output formats of agentic evaluations.
Available metrics
|
Metric |
What it measures |
Ground truth required |
|---|---|---|
|
Task completeness |
Whether the agentic AI asset fully addresses the user need. |
Optional |
|
Response accuracy |
Whether the agentic AI asset's response is factually accurate |
Recommended |
|
Groundedness |
Whether the agentic AI asset's response is grounded in the specific context of the task |
No |
|
Coherence |
Whether the agentic AI asset's response is logically structured and clear |
No |
|
Tool use accuracy |
Whether the agentic AI asset selected and used the correct tool to execute its tasks |
Optional |
|
Goal adherence |
Whether the agentic AI asset stayed within its defined scope and instructions |
No |
Issue types
Issues are broken down by behavior. Each metric has its own issues identified separately.
|
Category |
Agentic AI asset behavior |
|---|---|
|
Incomplete response |
Response failed to address the user's full request |
|
Factual error |
Response contained content that isn't factually correct |
|
Hallucination |
Response contained content not grounded in the specific context of the request |
|
Incoherent output |
Response was disorganized or difficult to understand |
|
Incorrect tool use |
Selected the wrong tool or passed incorrect parameters to a tool |
|
Scope violation |
Responded to a request outside its defined operating scope |
Data requirements
|
Requirement |
Description |
|---|---|
|
Minimum test cases |
A minimum number of test cases is required per run. The specific metrics you are using for the run may have their own minimum test cases. Ensure that your dataset meets the requirements for all metrics. |
|
Supported formats |
CSV and structured JSON are supported. |
|
Ground truth field |
If you're using a ground truth, it must be provided as a separate field in the dataset. The ground truth field must be aligned to each test case individually. |
|
Data representativeness |
Datasets should reflect all of the tasks that the AI agent or agentic workflow will handle. Include edge cases and failure-prone scenarios to help ensure that you're testing against common real-world scenarios. |