Reference for agentic evaluations

  • Release version: Australia
  • Updated March 18, 2026
  • 1 minute to read
  • Find technical reference material for roles, metrics, and output formats of agentic evaluations.

    Available metrics

    Table 1. Standard metrics available

    Metric

    What it measures

    Ground truth required

    Task completeness

    Whether the agentic AI asset fully addresses the user need.

    Optional

    Response accuracy

    Whether the agentic AI asset's response is factually accurate

    Recommended

    Groundedness

    Whether the agentic AI asset's response is grounded in the specific context of the task

    No

    Coherence

    Whether the agentic AI asset's response is logically structured and clear

    No

    Tool use accuracy

    Whether the agentic AI asset selected and used the correct tool to execute its tasks

    Optional

    Goal adherence

    Whether the agentic AI asset stayed within its defined scope and instructions

    No

    Issue types

    Issues are broken down by behavior. Each metric has its own issues identified separately.

    Table 2. Issue categories

    Category

    Agentic AI asset behavior

    Incomplete response

    Response failed to address the user's full request

    Factual error

    Response contained content that isn't factually correct

    Hallucination

    Response contained content not grounded in the specific context of the request

    Incoherent output

    Response was disorganized or difficult to understand

    Incorrect tool use

    Selected the wrong tool or passed incorrect parameters to a tool

    Scope violation

    Responded to a request outside its defined operating scope

    Data requirements

    Table 3. Data requirements for datasets in agentic evaluations

    Requirement

    Description

    Minimum test cases

    A minimum number of test cases is required per run. The specific metrics you are using for the run may have their own minimum test cases. Ensure that your dataset meets the requirements for all metrics.

    Supported formats

    CSV and structured JSON are supported.

    Ground truth field

    If you're using a ground truth, it must be provided as a separate field in the dataset. The ground truth field must be aligned to each test case individually.

    Data representativeness

    Datasets should reflect all of the tasks that the AI agent or agentic workflow will handle. Include edge cases and failure-prone scenarios to help ensure that you're testing against common real-world scenarios.