Evaluation in the Virtual Agent's asset record

Zurich Enable AI

Release

zurich

ft:locale

en-US

ft:publication_title

Zurich Enable AI

ft:clusterId

platai

bundleId

platai

workflow

Platform

Evaluation in the Virtual Agent's asset record

Release version: Zurich

Updated January 14, 2026

4 minutes to read

Summarize

Summarized using AI

Summary of Evaluation in the Virtual Agent's asset record

The Evaluation tab within the Virtual Agent's Asset record features the Evaluation dashboard, designed to measure, automate, and enhance the quality of interactions between end users and the Virtual Agent. This tool helps ServiceNow customers systematically track conversation quality, automate evaluations, and continuously improve virtual agent performance for an optimized user experience.

Show full answer Show less

Key Features

Evaluation Dashboard: Provides a centralized interface for monitoring Virtual Agent interactions using multiple metrics and visual widgets.
Automated Quality Assessment: Automatically evaluates conversation quality leveraging large language models (LLMs) and a set of custom skills such as Chat Topic Classifier and Coherence, among others.
Human Feedback Integration: Supports manual conversation evaluation and feedback entry by service desk managers to enrich automated scoring and benchmarking.
Exclusion Criteria for Evaluation: Conversations are excluded from auto-evaluation if they relate to HR topics, involve inaccessible or empty Knowledge Base articles, begin with immediate live agent transfer, are too short (configurable word count threshold), or meet custom exclusion triggers.
Evaluation Metrics: Conversations are scored on eight distinct metrics, each represented by a dedicated skill, facilitating detailed quality analysis.
Provider Flexibility: Default evaluation is performed using Now LLM, with options to switch to Azure OpenAI, Google Gemini, or AWS Claude to potentially improve results.
Dashboard Widgets: Include average auto-evaluation and human feedback scores, score trends with deviation overlays, total evaluations per week, and detailed human feedback records.
Sampling and Processing: Approximately 10% of daily conversations are sampled and processed through LLMs for evaluation, with results stored in designated tables for ongoing analysis.

Key Outcomes

Reliable Measurement: Enables systematic tracking of Virtual Agent conversation quality, supporting data-driven performance management.
Scalable and Sustainable Evaluation: Combines automated evaluations with manual feedback for a scalable, continuously improving assessment process.
Actionable Insights for Service Desk Managers: Facilitates monitoring and benchmarking of Virtual Agent interactions to identify improvement opportunities.
Improved End-User Experience: By iteratively refining Virtual Agent responses based on evaluation results, customers can expect enhanced interaction quality.

The Evaluation tab in the Virtual Agent's Asset record contains the Evaluation dashboard, which is designed to measure, automate, and improve the quality of interactions with Virtual Agent. This dashboard addresses several key challenges to enhance the end-user experience and overall virtual agent utility.

Evaluation dashboard

Prerequisites

You must Enabling evaluations.

Conversations are excluded from auto-evaluation if any of the following conditions are met:

HR conversations: Conversations related to Human Resources are filtered out, which means that they aren’t evaluated.
Inaccessible or empty Knowledge Base (KB) articles: Conversation involving a Genius Result that points to a KB article that is either not accessible via script or is empty. For example, certain restricted HR Knowledge articles.
Immediate live agent transfer: A conversation that begins immediately with transfer to a live agent, with no prior interaction with the virtual agent.
Short conversations: Conversations having fewer than 180 words before a live agent is invoked. The word count is configurable via the autoEvalConstants script Include. The assumption is that conversations below this threshold didn’t contain a meaningful interaction with the Virtual Agent.
Custom triggers: Any custom-defined exclusion triggers.

Evaluation dashboard overview

The Evaluation dashboard helps in:

Establishing a reliable measurement process by enabling the systematic tracking of the end-user experience with the Virtual Agent, providing deeper insights into interactions.
Automation of conversation quality evaluation by automating the process of evaluating conversation quality across different user interactions. This automation helps lead to the creation of a trusted, scalable metric for performance tracking.
Continuous improvement by supporting the iterative refinement of the virtual agent's performance, enhancing the overall user experience.
Scalable monitoring by helping ensure that the process of evaluating and tracking virtual agent quality is both efficient and scalable, promoting quick identification of issues and improvements over time.
User feedback integration through a set of optional questions enables you to provide direct feedback on their experience, which is used to improve the quality of future interactions.
Service desk manager insights by enabling service desk managers to track and review auto-evaluation scores over time. Managers can also manually add feedback for benchmarking purposes, providing valuable insights into conversation quality and opportunities for improvement.
Sustainable evaluation process by continuously improving virtual agent performance through a combined approach of automated evaluation and manual feedback enabling a scalable and sustainable system that evolves over time.

Important:

The evaluation dashboard doesn't support domain separation.

Overview tab

The Overview tab of the Evaluation dashboard provides a comprehensive view of all metrics and evaluation data.

The following widgets are available, showing various metrics:

Average auto-evaluation score for the selected metric: Shows the average auto-evaluation score for the metric selected and its trend over time.

For more information about each metric, see .
Average Human Feedback score for the selected metric: Shows the average human-labeled score for the selected metric.
Note:
The score is available only if there are sufficient chat records that are manually evaluated. For more information about manually evaluating conversations, see Human feedback for evaluations.
Evaluation score trend: Tracks the weekly score for the selected metric.

If you turn on the View Deviation and Adjusted Scores toggle, it shows the comparison between the auto-evaluated and user-defined scores by overlaying the upper, lower deviations, and the final adjusted score on the trend chart.

Note:
The deviation and adjusted scores are calculated only if you have at least 50 human labels.

For more information about how the calculations are made, see .
Evaluations: Shows the total number of conversations that were evaluated each week.
Human feedback section: Contains detailed information about each evaluation. From here, you can manually evaluate conversations. For more information, see Human feedback for evaluations.

Evaluations

Each conversation is evaluated on eight different metrics. For each of these metrics, there’s a separate skill. You can view these skills in Now Assist Skill Kit under Custom skills.

For more information about each metric, see .

Role required: sn_skill_builder.admin

Custom skills for evaluation.

The following Now Assist custom skills are used:

Chat Topic Classifier
Coherence Chat Evaluation
Conciseness Chat Eval
Context Retention
Inadequate Slot Filling Chat Eval
Intent Accuracy Chat Eval
Smooth Flowing Conversation Chat Eval
Truthfulness Hallucination Chat Eval

The default provider for these skills is Now LLM. You can change the provider to Azure OpenAI, Google Gemini, or AWS Claude. Azure OpenAI has been observed to improve results in certain scenarios.

For more information about Now Assist Skill Kit, see Now Assist Skill Kit.

Process of evaluation

Flow: Execute Evaluation.

10% of the daily conversations are sampled, checking if the conversation is good enough to be evaluated or not. The evaluation is done by building the transcripts for these conversations and then sending it to the set large language model (LLM).
For the conversations that are good enough to be evaluated, the transcripts along with the prompts for different scales are sent to the LLM and the LLM then evaluates the conversations.
After evaluation, the conversation goes through post processing, where the scores and the reason for scores that the LLM has provided are parsed and stored in the Evaluation and Evaluation Metrics tables.

Note:

Conversation evaluation estimates are considered as of the evaluation date and not the conversation created date. For example, if a chat that happened at time t is evaluated at time t+10, the scores from the evaluator is aggregated for the week of t+10 and not for the week of t.

For detailed information about the evaluation flow, see .