Agentic evaluation run results

  • Release version: Australia
  • Updated March 25, 2026
  • 3 minutes to read
  • Summarize
    Summarized using AI
    This content was generated using new OpenAI-powered functionality. Results are provided on an as is basis and are not guaranteed to be accurate or complete.

    Summary of Agentic evaluation run results

    Agentic evaluations assess how effectively AI agents and agentic workflows achieve their objectives by analyzing execution logs through a Now LLM Service model. The evaluation results page provides multiple metrics and scores that measure task completeness and tool usage. Customers can view recommended actions to guide deployment or improvement, archive or clone evaluations for repeated testing, and export results as CSV reports containing execution record IDs and metric scores.

    Show full answer Show less

    Evaluation results overview

    Each evaluation run displays an overall score representing the percentage of successful task completions, labeled as Excellent, Good, Moderate, or Poor. Thresholds for these labels can be customized. The labels help customers determine whether an AI agent or workflow is ready for deployment or requires improvement:

    • Excellent (90%–100%): High standard task completion; proceed with confidence.
    • Good (70%–89%): Mostly successful but some inconsistencies; deploy with caution.
    • Moderate (50%–69%): Significant incomplete tasks; investigate root causes.
    • Poor (0%–49%): Consistent failure to complete tasks; do not deploy.

    Individual record metric scores

    Evaluations are conducted on individual execution log records using specific metrics:

    • Overall task completeness: Measures if the AI agent fully completed the task logically and effectively.
      • 3 (Successful): Task fully completed with no critical errors.
      • 2 (Partially successful): Task partially completed with unresolved subtasks or inefficiencies.
      • 1 (Unsuccessful): Task not completed due to abandoned subtasks or execution failure.
    • Tool performance: Assesses whether the AI agent selected the correct tool for each action.
      • 1 (True): Correct tool chosen.
      • 0 (False): Incorrect tool chosen.
    • Tool calling records: Validates the accuracy, completeness, and formatting of tool call inputs.
      • 1 (True): All required parameters are present with exact names, values are correctly mapped, and formatting is correct.
      • 0 (False): Missing or incorrectly named parameters, incorrect value mapping, or wrong format.

      Note: All sub-metrics (input key completeness, input value correctness, input format correctness) must be true for an overall true score; if any sub-metric is false, the entire tool calling metric is false.

    Learn about agentic evaluation runs and the meaning behind different evaluation scores from the agentic evaluation results page.

    Agentic evaluations overview

    Agentic evaluations measure how well AI agents and agentic workflows are accomplishing their objectives. A Now LLM Service model judges the AI agent or agentic workflow based on the execution logs. The results page of an evaluation run shows multiple metrics and scores measuring task completeness and tool use.

    If you run an overall task completion evaluation, the results page shows recommended actions for the AI agent or agentic workflow. Recommended actions give you suggestions for deployment or improvement to help verify that the agentic workflows that you deploy are performing up to your standards.

    After you've reviewed your evaluation results, you can archive your evaluation or copy it to run another evaluation with the same parameters and dataset.

    You can export the evaluation results as a report. The report is formatted as a .csv file that includes the individual sys_ids of the execution records and the metric scores for each.

    For more information on AI agent usage and other analytics, you can review the AI Agent Analytics dashboard in the AI Agent Studio.

    Evaluation results overview

    For each evaluation method that you execute, the results page displays an overall score for the agentic workflow with a percentage of successful record evaluations and a label of Excellent, Good, Moderate, or Poor. You can change the metric thresholds for each label by selecting Customize metric thresholds.

    In addition to the overall task completeness results, you can review a summary of the results of the other metrics.

    Table 1. Overall task completeness evaluation run results

    Label

    Description

    Recommended action

    Default threshold

    Excellent

    Tasks were consistently performed at a high standard. The agentic workflow or AI agent is working well.

    Proceed with confidence

    90%–100%

    Good

    Most tasks were performed successfully, but some performance inconsistencies suggest areas for improvement.

    Deploy with caution

    70%–89%

    Moderate

    A significant number of tasks weren't fully completed. Performance is below the desired level.

    Investigate the root causes of poor task completion

    50%–69%

    Poor

    The agentic workflow is consistently failing to complete tasks adequately. Major issues are present.

    Do not deploy

    0%–49%

    Individual record metric scores

    Evaluations are run against the log tables of agentic workflow executions. Each record is individually scored for each evaluation plan that you run. Individual record evaluations are scored according to the following metrics.

    Table 2. Overall task completeness record metric scoresThe overall task completeness metric assesses whether an AI agent successfully completes its assigned task. It evaluates the execution logs of the agent, ensuring that all required steps were taken and the task was logically and effectively completed.

    Number

    Score

    Description

    3

    Successful

    The main task was fully completed. All subtasks were resolved, and the steps followed a logical sequence with no critical errors.

    2

    Partially successful

    The task was partially completed. Some subtasks remain unresolved or inefficiencies affected the process.

    1

    Unsuccessful

    The task wasn't completed. Critical subtasks were abandoned or unresolved or the execution failed entirely.

    Table 3. Tool performance record metric scoresThe tool performance evaluation metric assesses an AI agent's ability to select the most appropriate tool for each step while completing a task.

    Number

    Score

    Description

    1

    True

    The right tool was chosen for the action in the plan.

    0

    False

    The right tool wasn't chosen.

    Table 4. Tool calling records metric scoresThe tool calling evaluation metric assesses whether an AI agent correctly constructs tool calls by validating the accuracy, completeness, and formatting of the inputs it provides.

    Number

    Score

    Description

    1

    True

    Input key completeness, input value correctness, and input format correctness are all successful.

    • Input key completeness: 1 - True – All required parameters are present with exact name matches, and no unexpected parameters are included.
    • Input value correctness: 1 - True – Tool input values are correctly mapped.
    • Input format correctness: 1 - True – Tool inputs are in the correct format.

    0

    False

    One or more of input key completeness, input value completeness, or input format completeness wasn't successful.

    • Input key completeness: 0 - False – A mandatory parameter is either missing, its name doesn't match exactly, or an unexpected parameter was found.
    • Input value correctness: 0 - False – Tool input values are not correctly mapped.
    • Input format correctness: 0 - False – Tool inputs are not in the correct format.
    Note:
    The values of the sub-metrics are aggregated using an AND operator. If any one value is 0, then the entire tool calling records metric score will be 0.