Agentic evaluation run results

Yokohama Enable AI

Release

yokohama

ft:locale

en-US

ft:publication_title

Yokohama Enable AI

ft:clusterId

platai

bundleId

platai

workflow

Platform

Agentic evaluation run results

Release version: Yokohama

Updated July 31, 2025

2 minutes to read

Summarize

Summarized using AI

Summary of Agentic Evaluation Run Results

Agentic evaluations assess the performance of AI agents and workflows by analyzing execution logs. The evaluation results page provides multiple metrics and scores that reflect task completeness and tool usage, enabling customers to gauge their agentic workflows' effectiveness. If an overall task completion evaluation is conducted, the results include recommended actions for improving or deploying the agentic workflows.

Show full answer Show less

Key Features

Overall Score: Each evaluation method yields an overall score with a percentage of successful evaluations, categorized as Excellent, Good, Moderate, or Poor.
Customization: Users can adjust metric thresholds for each performance label by selecting "Customize metric thresholds."
Individual Record Evaluations: Each task is scored individually based on metrics assessing task completion, tool performance, and tool calling accuracy.

Key Outcomes

Excellent (90%-100%): The workflow is performing well; proceed with confidence.
Good (70%-89%): Some performance inconsistencies exist; deploy with caution.
Moderate (50%-69%): Investigate root causes for poor performance; significant tasks remain incomplete.
Poor (0%-49%): Major issues detected; do not deploy.

By leveraging these insights, ServiceNow customers can enhance their AI agents' performance and ensure optimal workflow execution.

Learn about agentic evaluation runs and the meaning behind different evaluation scores from the agentic evaluation results page.

Agentic evaluations overview

Agentic evaluations measure how well AI agents and agentic workflows are accomplishing their objectives. A Now LLM Service model judges the AI agent or agentic workflow based on the execution logs. The results page of an evaluation run shows multiple metrics and scores measuring task completeness and tool use.

If you run an overall task completion evaluation, the results page shows recommended actions for the AI agent or agentic workflow. Recommended actions give you suggestions for deployment or improvement to help ensure that the agentic workflows that you deploy are performing up to your standards.

For more information on AI agent usage and other analytics, you can review the AI Agent Analytics dashboard in the AI Agent Studio.

Evaluation results overview

For each evaluation method that you execute, the results page displays an overall score for the agentic workflow with a percentage of successful record evaluations and a label of Excellent, Good, Moderate, or Poor. You can change the metric thresholds for each label by selecting Customize metric thresholds.

Table 1. Overall task completeness evaluation run results
Label	Description	Recommended action	Default threshold
Excellent	Tasks were consistently performed at a high standard. The agentic workflow is working well.	Proceed with confidence	90%–100%
Good	Most tasks were performed successfully, but some performance inconsistencies suggest areas for improvement.	Deploy with caution	70%–89%
Moderate	A significant number of tasks weren’t fully completed. Performance is below the desired level.	Investigate the root causes of poor task completion	50%–69%
Poor	The agentic workflow is consistently failing to complete tasks adequately. Major issues are present.	Do not deploy	0%–49%

Individual record metric scores

Evaluations are run against the log tables of agentic workflow executions. Each record is individually scored for each evaluation plan that you run. Individual record evaluations are scored according to the following metrics.

Table 2. Overall task completeness record metric scoresThe overall task completeness metric assesses whether an AI agent successfully completes its assigned task. It evaluates the execution logs of the agent, ensuring that all required steps were taken and the task was logically and effectively completed.
Number	Score	Description
3	Successful	The main task was fully completed. All subtasks were resolved, and the steps followed a logical sequence with no critical errors.
2	Partially successful	The task was partially completed. Some subtasks remain unresolved or inefficiencies affected the process.
1	Unsuccessful	The task wasn’t completed. Critical subtasks were abandoned or unresolved or the execution failed entirely.

Table 3. Tool performance record metric scoresThe tool performance evaluation metric assesses an AI agent’s ability to select the most appropriate tool for each step while completing a task.
Number	Score	Description
1	True	The right tool was chosen for the action in the plan.
0	False	The right tool wasn’t chosen.

Table 4. Tool calling records metric scoresThe tool calling evaluation metric assesses whether an AI agent correctly constructs tool calls by validating the accuracy, completeness, and formatting of the inputs it provides.
Number	Score	Description
1	True	Input key completeness, input value completeness, and input format completeness were successful.
0	False	One or more of input key completeness, input value completeness, or input format completeness wasn’t successful.