Agentic evaluation run results
Summarize
Summary of Agentic evaluation run results
Agentic evaluations in ServiceNow measure how effectively AI agents and agentic workflows achieve their objectives by analyzing execution logs. A Now LLM Service model assesses these agents or workflows and provides multiple metrics and scores related to task completeness and tool use. This helps customers verify that their AI agents perform to expected standards before deployment.
Show less
Evaluation results are accessible on a results page, which includes overall scores, detailed metric summaries, and recommended actions. Customers can archive evaluations, rerun them with the same parameters and data, or export results as CSV reports containing execution record IDs and metric scores. For broader insights, users can consult the AI Agent Analytics dashboard in AI Agent Studio.
Key Features
- Overall task completeness evaluation: Displays a percentage score and labels tasks as Excellent, Good, Moderate, or Poor based on predefined thresholds, which can be customized. This rating guides deployment decisions and improvement efforts.
- Recommended actions: Suggestions are provided per evaluation result to help customers decide whether to deploy an AI agent or investigate issues for enhancement.
- Individual record scoring: Each execution record is scored against multiple metrics to assess task success and tool usage accuracy.
- Exportable reports: Results can be exported in CSV format for detailed analysis or record keeping.
Key Outcomes
- Overall task completeness ratings:
- Excellent (90-100%): High standard task completion; safe to proceed with deployment.
- Good (70-89%): Mostly successful tasks with some inconsistencies; deploy with caution.
- Moderate (50-69%): Many incomplete tasks; requires investigation before deployment.
- Poor (0-49%): Consistent failures; do not deploy.
- Individual record metrics:
- Overall task completeness score: Ranges from 3 (successful) to 1 (unsuccessful), assessing if all subtasks were completed logically without critical errors.
- Tool performance score: Indicates if the AI agent selected the appropriate tool for each task step (1 = right tool chosen, 0 = wrong tool).
- Tool calling records score: Validates the correctness of tool call inputs (parameters completeness, value correctness, and format). A score of 1 means all conditions are met; 0 means one or more failed.
By using these evaluation results, ServiceNow customers can ensure their AI agents and workflows operate effectively, optimize tool usage, and confidently proceed with deployment or targeted improvements based on clear, actionable insights.
Learn about agentic evaluation runs and the meaning behind different evaluation scores from the agentic evaluation results page.
Agentic evaluations overview
Agentic evaluations measure how well AI agents and agentic workflows are accomplishing their objectives. A Now LLM Service model judges the AI agent or agentic workflow based on the execution logs. The results page of an evaluation run shows multiple metrics and scores measuring task completeness and tool use.
If you run an overall task completion evaluation, the results page shows recommended actions for the AI agent or agentic workflow. Recommended actions give you suggestions for deployment or improvement to help verify that the agentic workflows that you deploy are performing up to your standards.
After you've reviewed your evaluation results, you can archive your evaluation or copy it to run another evaluation with the same parameters and dataset.
You can export the evaluation results as a report. The report is formatted as a .csv file that includes the individual sys_ids of the execution records and the metric scores for each.
For more information on AI agent usage and other analytics, you can review the AI Agent Analytics dashboard in the AI Agent Studio.
Evaluation results overview
For each evaluation method that you execute, the results page displays an overall score for the agentic workflow with a percentage of successful record evaluations and a label of Excellent, Good, Moderate, or Poor. You can change the metric thresholds for each label by selecting Customize metric thresholds.
In addition to the overall task completeness results, you can review a summary of the results of the other metrics.
|
Label |
Description |
Recommended action |
Default threshold |
|---|---|---|---|
|
Excellent |
Tasks were consistently performed at a high standard. The agentic workflow or AI agent is working well. |
Proceed with confidence |
90%–100% |
|
Good |
Most tasks were performed successfully, but some performance inconsistencies suggest areas for improvement. |
Deploy with caution |
70%–89% |
|
Moderate |
A significant number of tasks weren't fully completed. Performance is below the desired level. |
Investigate the root causes of poor task completion |
50%–69% |
|
Poor |
The agentic workflow is consistently failing to complete tasks adequately. Major issues are present. |
Do not deploy |
0%–49% |
Individual record metric scores
Evaluations are run against the log tables of agentic workflow executions. Each record is individually scored for each evaluation plan that you run. Individual record evaluations are scored according to the following metrics.
|
Number |
Score |
Description |
|---|---|---|
|
3 |
Successful |
The main task was fully completed. All subtasks were resolved, and the steps followed a logical sequence with no critical errors. |
|
2 |
Partially successful |
The task was partially completed. Some subtasks remain unresolved or inefficiencies affected the process. |
|
1 |
Unsuccessful |
The task wasn't completed. Critical subtasks were abandoned or unresolved or the execution failed entirely. |
|
Number |
Score |
Description |
|---|---|---|
|
1 |
True |
The right tool was chosen for the action in the plan. |
|
0 |
False |
The right tool wasn't chosen. |
|
Number |
Score |
Description |
|---|---|---|
|
1 |
True |
Input key completeness, input value correctness, and input format correctness are all successful.
|
|
0 |
False |
One or more of input key completeness, input value completeness, or input format completeness wasn't successful.
|