Run an agentic evaluation

  • Release version: Xanadu
  • Updated April 3, 2025
  • 2 minutes to read
  • Evaluate an agentic workflow against a dataset of your choice to monitor performance and evaluate it against different benchmarks.

    Before you begin

    Evaluation runs require execution log data of the agentic workflow you want to evaluate. For a new agentic workflow, you can create execution logs by testing in AI Agent Studio. For more information about testing agentic workflows, see Test an agentic workflow.

    For more information about getting started with agentic evaluations, see General guidelines for agentic evaluation runs.

    Role required: sn_aia.admin

    Procedure

    1. Navigate to All > Now Assist Skill Kit > Agentic Evaluations.

      You can also start from the testing page of the AI Agent Studio. Navigate to All > AI Agent Studio > Testing. Select an agentic workflow and then select Set up evaluation run. A modal appears to ask if you want to be redirected to Now Assist Skill Kit. Select Open Skill Kit. You’ll be redirected to the Guided Setup.

    2. On the evaluations home page, select New evaluation run to begin the guided setup.
    3. In the Add general info step, add a name and select the agentic workflow that you want to evaluate.

      Categorize Incident Agentic Workflow agentic evaluation guided setup step for adding general info with name, description, and categorize incident as the selected agentic workflow.

    4. Select Continue to go to the next step.

      Each time you navigate through a step, the evaluation run is saved automatically as a draft. At any point, you can select Save as draft.

      If you want to exit the guided setup, you can select Exit setup. You’re redirected to the Agentic Evaluations page.

      • If you select Save and exit, the evaluation run appears in the list on the Agentic Evaluations page with the status of Draft.
      • If you select Discard and exit, the evaluation run draft is deleted.
    5. Select your evaluation method.

      Overall task completeness evaluation is selected by default. Running multiple evaluation methods at a time can help provide a more comprehensive overview of the agentic workflow's performance.

      To see more information about each plan, you can expand the card for each evaluation plan by selecting the chevron icon (Chevron icon.).

      Categorize Incident Agentic Workflow agentic evaluation guided setup step for selecting an evaluation method with four options selected (overall task completeness, plan, tool performance, and tool calling).

    6. Choose your dataset.
      1. Select an existing dataset or create your own.
      2. To create a new dataset, fill out the form.
        Table 1. Choose a dataset form

        Field name

        Description

        Name

        Name of the dataset.

        Description

        General description of the dataset and its intended purpose.

        Max records (optional)

        The maximum number of records within the dataset you want to run the evaluation on. If there are more records in the dataset than the maximum number of records, any records after the maximum number of records will be ignored for that evaluation run.

        Filters

        Conditions for narrowing down the AI execution log records you want to include in the dataset. By default, the agentic workflow that you’re evaluating is selected as a filter condition.

        Categorize Incident Agentic Evaluation dataset with no maximum records and a filter for the usecase field is categorize incident.

      3. Select See preview to see a list of records based on the conditions you specified.
        You can narrow down the records further by only selecting some of the records in the preview list. Unselected records won’t be included in the dataset.
    7. Review the agentic evaluation details in the last step of the guided setup.

      If you notice any place where you want to make changes, you can select Back to go to a previous step, or you can select the step in the sidebar.

      Categorize Incident Agentic Evaluation review page of the guided setup.

    8. Select Start evaluation.

    Result

    Your evaluation run executes. The time it takes for an evaluation run to complete varies, but once it has been complete you can select the evaluation from the Agentic Evaluations page to view the results.

    For more information on the metrics on the results page, see Agentic evaluation run results.