Deploy AI Agents with confidence using Agentic Evaluations

Ashley Snyder · ‎12-12-2025

⏱️ 8-minute read

As organizations build more sophisticated AI agents and agentic workflows in ServiceNow, a critical question emerges: How do you know your agents are ready for production?

Traditional testing methods work well for deterministic workflows, where the same input consistently produces the same output. But AI agents are different. They're flexible and adaptive, making context-aware decisions with variable outputs. This means they require a new approach to quality validation.

📋 Quick Overview

Agentic Evaluations helps you validate AI agents before production deployment. You'll configure evaluation metrics, run automated quality checks, and get clear scores showing if your agent is ready to deploy. Think of it as your pre-flight checklist for agentic workflows.

In this article:

✓ Why agentic workflows need different validation
✓ How Agentic Evaluations works
✓ Reading and acting on your results
✓ What's new in Q4 2025

❓ Why Agentic Workflows Need Different Quality Validation

Let's discuss what makes agentic workflows different from traditional automation:

Deterministic Workflows

Rule-based and predictable
Same input → same output
Good for well-defined tasks
Traditional pass/fail testing works well

Agentic Workflows

Flexible and adaptive
Context-aware with variable outputs
Good for complex, undefined tasks
Requires validation beyond pass/fail

Summary: Deterministic workflows follow fixed rules with consistent outputs, while agentic workflows adapt to context with variable outputs—requiring AI-powered evaluation instead of traditional testing.

When your workflow uses AI agents to interpret user intent, choose the right tools, and orchestrate multi-step processes, you need more than unit tests. You need AI-powered agentic evaluation.

✓ Key Takeaway: Agentic workflows are flexible and adaptive, which means they need AI-powered evaluation instead of traditional pass/fail testing.

🔄 Where Agentic Evaluations Fits in Your Process

Agentic Evaluations serves as your quality gate between development and production. It helps you validate agent readiness at scale before making them available to end users.

Here's how it fits with other ServiceNow AI capabilities:

Plan & Build (AI Agent Studio)
Create your agents, configure their tools and instructions, define workflows
Manual Test (AI Agent Studio)
Run test conversations to validate basic behavior, refine agent responses through iterative testing
Automated Evaluation (Agentic Evaluations) ← You are here
Measure task completion rates, tool calling accuracy, and overall quality at scale with automated judge-based evaluation across scaled execution logs
Deploy
Promote production-ready agents to live environments and make them available to end users
Monitor (AI Agent Analytics)
Track ongoing production performance, measure user satisfaction, identify areas for continuous improvement

💡 Think of Agentic Evaluations as your pre-flight checklist: After building and manually testing your agent, use automated evaluation to validate it works correctly across diverse scenarios before deployment. Once deployed, AI Agent Analytics helps you monitor ongoing performance and identify opportunities for refinement.

🤖 What Are Agentic Evaluations?

Agentic Evaluations is an automated quality validation capability that helps you assess AI agent performance before production deployment. Think of it as your pre-flight checklist for agentic workflows.

How it works:

Agentic Evaluations uses LLM-based judges to evaluate your agent's execution logs and assess quality—similar to how an experienced ServiceNow administrator would review agent behavior.

Rather than just checking final outputs, the evaluation examines the complete sequence of decisions and actions: every tool selection, every parameter passed, every step in the workflow.

The judge evaluates whether your agent achieved the right outcome, even when the path varies between executions. Each evaluation produces a score to help you understand your agent's performance.

📖 Key Terms

Execution logs: Complete record of an AI agent's actions when completing a task, including every tool selection, parameter, and decision made

LLM-based judge: An AI model that evaluates agent performance by analyzing execution logs—similar to how an experienced administrator would review agent behavior

Agentic workflow: A multi-step business process where AI agents make context-aware decisions, choose tools, and orchestrate actions to accomplish tasks

The framework provides three core evaluation metrics:

1. Overall Task Completeness

The overall task completeness metric assesses whether an AI agent successfully completes its assigned task. It evaluates the execution logs of the agent, ensuring that all required steps were taken and the task was logically and effectively completed.

This metric analyzes your agent's execution record, from the initial user request through each tool invocation to the final outcome, to verify real business outcomes were achieved.

Example: If you have an incident resolution workflow, this metric evaluates: Did the agent correctly categorize the incident, retrieve the relevant knowledge article, apply the fix, and update the incident record?

2. Tool Calling Correctness

The tool calling evaluation metric assesses whether an AI agent correctly constructs tool calls by validating the accuracy, completeness, and formatting of the inputs it provides.

Even when the agent picks the right tool, this metric ensures the inputs will lead to successful execution.

Example: If your agent calls "Update Incident Record," this metric checks: Is the incident sys_id present and properly formatted? Are required fields like priority and state included? Are the values valid according to your ServiceNow instance configuration?

3. Tool Choice Accuracy

The tool choice accuracy evaluation metric assesses an AI agent’s ability to select the most appropriate tool for each step while completing a task.

The judge analyzes the context, available tools, and task requirements to verify intelligent decision-making throughout the execution.

Example: When resolving a service request, did the agent correctly choose the "Get Knowledge Article" tool when it needed information, or did it mistakenly try to update a record before gathering necessary details?

Additional Capabilities:

Beyond the core evaluation metrics, these capabilities help you integrate Agentic Evaluations into your development and deployment process:

🔄 Run an agentic workflow to generate execution logs

Automatically evaluate your agentic workflows or agents at scale across multiple records using an LLM as a conversational partner to drive workflow and agent execution. See "What's New in Q4 2025" for details.

⚙️ Customizable quality thresholds

Different agents have different quality requirements. Use the "Customize metric thresholds" button to set appropriate standards.

A customer-facing incident resolution agent might need 90%+ task completion to be considered "Good," while an internal categorization agent might accept 75%. Adjust thresholds to match your specific risk tolerance and business impact.

📤 Export for stakeholder approval

Before production deployment, you'll need buy-in from multiple stakeholders. The "Export as report" feature generates comprehensive documentation showing evaluation scores, specific failures, and improvement recommendations.

Share these reports with compliance teams, service owners, and leadership to inform deployment decisions and create audit documentation.

📋 Clone for regression evaluation

As you refine your agent, you want to ensure changes don't break existing functionality. Clone your baseline evaluation to create an identical evaluation configuration, then re-run it after each change.

This gives you an objective before/after comparison showing whether your modifications improved or degraded performance.

⏱️ Monitor large evaluation runs

When evaluating against 100+ records, evaluations can take 10-30 minutes to complete.

Real-time progress tracking shows you exactly how far along the evaluation is, helping you plan your next iteration, grab coffee while waiting, or start reviewing partial results as they become available.

⚙️ How It Works: The Three-Step Journey

1

Configure

Pick an agentic workflow or standalone AI agent to evaluate, choose your evaluation metrics (we recommend starting with all three), and select your evaluation data source.

2

Evaluate

Run the automated evaluation (typically completes in minutes). Track progress in real-time while AI-powered judges assess your agent's performance across the metrics you selected.

The evaluation system analyzes each execution, examining the complete sequence of decisions: every tool selection, every parameter passed, every step in the workflow. You get consistent, objective scoring at scale.

3

Optimize

Review the evaluation dashboard and detailed results to identify patterns in your agentic workflows and agent performance.

Drill down into individual execution records to understand what needs improvement, refine your agents based on what you learn, then re-evaluate to measure progress.

You have three options for evaluation data:

Option 1: Use existing execution logs from your instance

Leverage logs from agentic executions that have already run in your environment.

Best for: Real-world performance evaluation, post-deployment analysis, or validating agents that already have production traffic.

💡 Recommendation: Clone execution log data from production to a sub-production environment.

Option 2: Generate new execution logs through manual testing

Run your agentic workflow or agent in AI Agent Studio to create fresh execution log data with full control over specific scenarios.

Best for: Precise control over test cases or active development.

Option 3: Run an agentic workflow to generate execution logs 🆕

Rather than waiting for real users or manually creating records one at a time, you can now automatically evaluate your agents at scale across multiple records.

An LLM acts as a conversational partner with your agent, autonomously driving workflow execution to generate execution logs for evaluation.

Best for: Fast iteration cycles and evaluating 50-100+ scenarios quickly.

✓ Key Takeaway: With the ability to automatically generate execution logs, you can iterate faster and test at scale without manual effort.

▶️ Watch: Configuring and Running an Agentic Evaluation

See the complete Configure → Evaluate → Optimize workflow in action

📊 Reading Your Results

After running an evaluation, you'll see AI-powered scoring with clear deployment guidance.

Each metric receives a rating like "Good," "Excellent," or "Deploy with caution", giving you visibility into your agentic workflow or agent's readiness. You can then drill down into individual execution records to understand what needs improvement.

What the ratings mean:

Good: Most tasks completed successfully, but some performance inconsistencies suggest areas for improvement
Excellent: The agent consistently selects the best tools for each task
Moderate: A significant portion of tool inputs had correctness issues

Each metric includes:

A clear rating badge (Good, Excellent, Moderate, etc.)
Specific percentages showing how execution logs performed across your dataset
Result explanations describing what the evaluation observed—which steps succeeded, which failed, and common patterns
Recommended actions providing specific guidance on what to investigate

Drilling down into details:

You can drill down into individual execution logs to see the complete workflow:

Exactly which tools the agent invoked
What parameters were passed to each tool
Which steps succeeded or failed
How the workflow progressed from start to finish

This visibility helps you identify patterns across multiple executions rather than debugging individual cases.

✓ Best Practices for Effective Evaluation

1

Start with Enough Data

10-50 samples: Initial testing and debugging
100-300 samples: Production rollout readiness

Start small to validate your approach, then expand as you gain confidence.

2

Measure Value by Task Completion

Success means the agent correctly completes a complex workflow that delivers business value: faster resolution, accurate record updates, proper escalation. Don't just measure whether the chat ended.

3

Mix Automated and Human Evaluation

Use automated evaluation runs for scale and consistency, but dedicate human time to manually validate sensitive, ambiguous, and high-risk outputs. The automated scores are AI-generated and probabilistic, human review is essential.

4

Fix Patterns, Not Individual Failures

When you see failures, look for common issues across multiple executions. Fixing one case at a time can create new problems elsewhere. Identify systematic issues with agent instructions, tool descriptions, or workflow design.

🆕 What's New in Q4 2025

This release brings two significant enhancements that fundamentally transform how you can evaluate and validate AI agents:

Run an agentic workflow to generate execution logs

The Q4 release introduces the option to run an agentic workflow or standalone agent to create execution logs.

This capability uses an LLM as a conversational partner to automatically run your agentic workflow or agents against multiple records at scale.

How it works:

You provide the inputs:

The agent workflow to evaluate
A starting phrase like "resolve incident {{incident.number}}"
Business context about your environment

The LLM uses this information to simulate realistic user dialogue, engaging your agent in conversation and autonomously guiding the workflow toward completion.

As your agent responds and executes actions, the system captures complete execution logs showing every tool call, every parameter, and every decision made.

💡 Think of it as: The LLM plays the role of a user having a conversation with your agent. Your agent executes real workflows using real tools against actual records in your instance. Instead of you manually testing 10-20 records, the LLM can automatically drive your agent through 50-100+ executions.

Configuration steps:

Selecting records to evaluate: Filter to specific records (e.g., incidents where State = New) and set the maximum number to evaluate
Providing a starting phrase: Define the conversation opener (e.g., "resolve incident")
Adding business context: Provide 4-5 sentences of background information to help the LLM engage realistically with your agent
Previewing your selection: Review which records match your filter
Executing automatically: The system runs your workflow against each record with the LLM acting as the conversational partner

Why this matters:

Manual generation creates three bottlenecks:

You can't evaluate without real users yet
Manually creating multiple records is time-consuming
Iteration cycles are slow

The ability to run an agentic workflow to generate execution logs eliminates these bottlenecks. You can automatically evaluate against 50-100+ records covering different data variations before deployment.

⚠️ Important: This capability executes real workflows and modifies real records. Always use in sub-production or demo environments only, never in production.

Evaluate Standalone AI Agents

Agentic Evaluations now supports evaluating individual AI agents before you include them in your workflows.

This lets you validate each agent works correctly on its own before combining multiple agents into complete business processes.

Why evaluate agents individually?

When you build agentic workflows, you typically combine several specialized agents together. One agent might categorize incidents, another retrieves knowledge articles, and a third updates records.

If the complete workflow isn't performing well, it's hard to know which specific agent is causing issues.

Evaluating agents individually helps you:

Validate before combining: Confirm each agent performs its specific task correctly before you add it to a multi-step workflow
Identify problem areas quickly: When a workflow evaluation shows poor results, evaluate each agent separately to pinpoint exactly which one needs improvement
Measure improvement accurately: After refining an agent's instructions or adding new capabilities, re-evaluate just that agent to see if your changes worked, without the noise from other agents in the workflow
Build reusable agents with confidence: If you plan to use the same agent across multiple workflows (like a categorization agent used in both IT and HR processes), validate it works reliably before deploying it everywhere

When to use each approach:

Evaluate individual agents when:

You're building a new agent and want to confirm it works before adding it to a workflow
A complete workflow isn't performing well and you need to find out which specific agent is the problem
You've updated an agent and want to verify the changes improved its performance
You're creating an agent that will be used in multiple different workflows

Evaluate complete workflows when:

You need to confirm multiple agents work together correctly to achieve a business outcome
You're preparing to deploy to production and need end-to-end validation
Success depends on agents passing information correctly from one step to the next
You want to validate the complete process, not just individual pieces

✓ Key Takeaway: Think of standalone agents like unit evaluation and complete workflows like end-to-end evaluation. You need both to build confidence in your AI agents.