General guidelines for agentic evaluation runs

  • Release version: Xanadu
  • Updated April 8, 2025
  • 1 minute to read
  • Learn about agentic evaluation runs and different recommendations for evaluating your agentic workflows against datasets to check for completion, performance, and tool execution.

    Overview of agentic evaluation runs

    Evaluation runs for agentic workflows evaluate agentic workflow executions for different metrics, such as task completion, performance, and tool execution. You can create datasets using logs for agentic workflows.

    When to run agentic evaluations

    Run after you have collected enough data.
    Evaluation runs are measured against logs of agentic workflow activity on your instance.
    Run agentic evaluations when you make significant changes.
    After making updates to the agentic workflow, you can execute an agentic evaluation run to track the efficacy of the new version.

    Choosing an evaluation method

    Review the evaluation method options.
    The agentic evaluation Guided Setup provides information about each evaluation method, including what they’re measuring and how they work. You can also review the common questions in the sidebar for answers about the available metrics.
    Use multiple evaluation methods at a time.
    Choosing multiple evaluation methods can provide a better overall picture of the agentic workflow's performance.

    Creating a dataset

    Use filters to target the right data.
    Add filters to the execution logs to control exactly what you're measuring your agentic workflow against. Filter different time frames to verify that you're measuring the latest version of a workflow. You can select See preview to see a list of records. You can also use the check boxes to select individual records to measure against.