Join the #BuildWithBuildAgent Challenge! Get recognized, earn exclusive swag, and inspire the ServiceNow Community with what you can build using Build Agent.  Join the Challenge.

Ashley Snyder
ServiceNow Employee
ServiceNow Employee

 

As organizations build more sophisticated AI agents and agentic workflows in ServiceNow, a critical question emerges: How do you know your agents are ready for production?

 

Traditional testing methods work well for deterministic workflows, where the same input consistently produces the same output. But AI agents are different. They're flexible and adaptive, making context-aware decisions with variable outputs. This means they require a new approach to quality validation.

 

OverviewDash.png

 

 


Why Agentic Workflows Need Different Quality Validation

 

WorkflowSlides.png

 

 

Let's discuss what makes agentic workflows different from traditional automation:

 

Deterministic Workflows

  • Rule-based and predictable
  • Same input → same output
  • Good for well-defined tasks
  • Traditional pass/fail testing works well

Agentic Workflows

  • Flexible and adaptive
  • Context-aware with variable outputs
  • Good for complex, undefined tasks
  • Requires validation beyond pass/fail: verifying tool selection, parameter accuracy, and genuine task completion

 

When your workflow uses AI agents to interpret user intent, choose the right tools, and orchestrate multi-step processes, you need more than unit tests. You need AI-powered agentic evaluation.

 


Where Agentic Evaluations Fits in Your Process

 

Process.png

 

 

Agentic Evaluations serves as your quality gate between development and production, helping you validate agent readiness at scale before making them available to end users.

 

Here's how it fits with other ServiceNow AI capabilities:

 

  1. Plan & Build (AI Agent Studio)
    Create your agents, configure their tools and instructions, define workflows
  2. Manual Test (AI Agent Studio)
    Run test conversations to validate basic behavior, refine agent responses through iterative testing
  3. Automated Evaluation (Agentic Evaluations) ← You are here
    Measure task completion rates, tool calling accuracy, and overall quality at scale with automated judge-based evaluation across scaled execution logs
  4. Deploy
    Promote production-ready agents to live environments and make them available to end users
  5. Monitor (AI Agent Analytics)
    Track ongoing production performance, measure user satisfaction, identify areas for continuous improvement

💡 Think of Agentic Evaluations as your pre-flight checklist: After building and manually testing your agent, use automated evaluation to validate it works correctly across diverse scenarios before deployment. Once deployed, AI Agent Analytics helps you monitor ongoing performance and identify opportunities for refinement.

 


What Are Agentic Evaluations?

 

Agentic Evaluations is an automated quality validation capability that helps you assess AI agent performance before production deployment. Think of it as your pre-flight checklist for agentic workflows.

 

How it works: Agentic Evaluations uses LLM-based judges to evaluate, where an AI model analyzes your agent's execution logs and assesses quality—similar to how an experienced ServiceNow administrator would review agent behavior. Rather than just checking final outputs, the evaluation examines the complete sequence of decisions and actions: every tool selection, every parameter passed, every step in the workflow. The judge evaluates whether your agent achieved the right outcome, even when the path varies between executions. Each evaluation produces a score to help you understand your agent's performance.

 

The framework provides three core evaluation metrics:

 

Metrics.png

 

1. Overall Task Completeness

Validates actual completion: Confirms that workflows genuinely accomplished their assigned tasks, not just that agents reported success. This metric analyzes your agent's execution record, from the initial user request through each tool invocation to the final outcome, to verify real business outcomes were achieved.

 

For example, if you have an incident resolution workflow, this metric evaluates: Did the agent correctly categorize the incident, retrieve the relevant knowledge article, apply the fix, and update the incident record? It's your workflow-level view confirming that the multi-step orchestration achieved the intended business outcome.

2. Tool Calling Correctness

Validates parameter accuracy and completeness: Confirms that agents correctly construct tool calls with accurate parameters, proper formatting, and all required fields. Even when the agent picks the right tool, this metric ensures the inputs will lead to successful execution.

 

For example, if your agent calls "Update Incident Record," this metric checks: Is the incident sys_id present and properly formatted? Are required fields like priority and state included? Are the values valid according to your ServiceNow instance configuration? This granular validation ensures that tool selection translates into successful tool execution.

3. Tool Choice Accuracy

Validates optimal tool selection: Confirms that at each decision point in the workflow, your agent selected the most appropriate tool for the task at hand. The judge analyzes the context, available tools, and task requirements to verify intelligent decision-making throughout the execution.

 

For example, when resolving a service request, did the agent correctly choose the "Get Knowledge Article" tool when it needed information, or did it mistakenly try to update a record before gathering necessary details? This validates decision-making quality before choices cascade into downstream issues.

 

Additional Capabilities:

Beyond the core evaluation metrics, these capabilities help you integrate Agentic Evaluations into your development and deployment process:

  • Run an agentic workflow to generate execution logs: Automatically evaluate your agentic workflows or agents at scale across multiple records using an LLM as a conversational partner to drive workflow and agent execution. See "What's New in Q4 2025" for details.
  • Customizable quality thresholds: Different agents have different quality requirements. Use the "Customize metric thresholds" button to set appropriate standards, a customer-facing incident resolution agent might need 90%+ task completion to be considered "Good," while an internal categorization agent might accept 75%. Adjust thresholds to match your specific risk tolerance and business impact.

 

SetMetrics.png

 

  • Export for stakeholder approval: Before production deployment, you'll need buy-in from multiple stakeholders. The "Export as report" feature generates comprehensive documentation showing evaluation scores, specific failures, and improvement recommendations. Share these reports with compliance teams, service owners, and leadership to inform deployment decisions and create audit documentation.
  • Clone for regression evaluation: As you refine your agent, you want to ensure changes don't break existing functionality. Clone your baseline evaluation to create an identical evaluation configuration, then re-run it after each change. This gives you an objective before/after comparison showing whether your modifications improved or degraded performance.
  • Monitor large evaluation runs: When evaluating against 100+ records, evaluations can take 10-30 minutes to complete. Real-time progress tracking shows you exactly how far along the evaluation is, helping you plan your next iteration, grab coffee while waiting, or start reviewing partial results as they become available.

 


How It Works: The Three-Step Journey

 

1. Configure

Pick an agentic workflow or standalone AI agent to evaluate, choose your evaluation metrics (we recommend starting with all three), and select your evaluation data source.

 

You have three options for evaluation data:

 

Option 1: Use existing execution logs from your instance

Leverage logs from agentic executions that have already run in your environment. This is ideal when you want to evaluate real-world performance with, conduct post-deployment analysis, or validate agents that already have production traffic. Our recommendation is to clone execution log data from production to a sub-production environment in order to use this option.

Option 2: Generate new execution logs through manual testing

Run your agentic workflow or agent in AI Agent Studio to create fresh execution log data with full control over specific scenarios. This works well when you need precise control over test cases or are in active development.

Option 3: Run an agentic workflow to generate execution logs (New!)

Rather than waiting for real users or manually creating records one at a time, you can now automatically evaluate your agents at scale across multiple records. An LLM acts as a conversational partner with your agent, autonomously driving workflow execution to generate execution logs for evaluation.

 

 

2. Evaluate

Run the automated evaluation (typically completes in minutes). Track progress in real-time while AI-powered judges assess your agent's performance across the metrics you selected.

 

The evaluation system analyzes each execution, whether from real usage, manual testing, or automated agent execution, examining the complete sequence of decisions: every tool selection, every parameter passed, every step in the workflow. You get consistent, objective scoring at scale, regardless of your data source.

 

3. Optimize

Review the evaluation dashboard and detailed results to identify patterns in your agentic workflows and agent performance. Drill down into individual execution records to understand what needs improvement, refine your agents based on what you learn, then re-evaluate to measure progress.

 

With the ability to run an agentic workflow to generate execution logs, this optimization cycle becomes faster. Instead of making a change and executing manually, you can automatically evaluate against multiple records and iterate more quickly.

 


Reading Your Results

 

Dashboard2.png

 

 

After running an evaluation, you'll see AI-powered scoring with clear deployment guidance. Each metric receives a rating like "Good," "Excellent," or "Deploy with caution", giving you visibility into your agentic workflow or agent's readiness. You can then drill down into individual execution records to understand what needs improvement.

 

  • Good: Most tasks completed successfully, but some performance inconsistencies suggest areas for improvement
  • Excellent: The agent consistently selects the best tools for each task
  • Moderate: A significant portion of tool inputs had correctness issues

Each metric includes:

  • A clear rating badge (Good, Excellent, Moderate, etc.)
  • Specific percentages showing how execution logs performed across your dataset
  • Result explanations describing what the evaluation observed—which steps succeeded, which failed, and common patterns
  • Recommended actions providing specific guidance on what to investigate (e.g., "Investigate recurring failures, identify root causes, and refine tool execution logic")

 

You can drill down into individual execution logs to see the complete workflow: exactly which tools the agent invoked, what parameters were passed to each tool, which steps succeeded or failed, and how the workflow progressed from start to finish. This visibility helps you identify patterns across multiple executions rather than debugging individual cases.

 

ExecutionResult.png

 

 


Best Practices for Effective Evaluation

 

1

Start with Enough Data

  • 10-50 samples: Initial testing and debugging
  • 100-300 samples: Production rollout readiness
    Start small to validate your approach, then expand as you gain confidence.
2

Measure Value by Task Completion

Success means the agent correctly completes a complex workflow that delivers business value, faster resolution, accurate record updates, proper escalation. Don't just measure whether the chat ended.

3

Mix Automated and Human Evaluation

Use automated evaluation runs for scale and consistency, but dedicate human time to manually validate sensitive, ambiguous, and high-risk outputs. The automated scores are AI-generated and probabilistic, human review is essential.

4

Fix Patterns, Not Individual Failures

When you see failures, look for common issues across multiple executions. Fixing one case at a time can create new problems elsewhere. Identify systematic issues with agent instructions, tool descriptions, or workflow design.

 


What's New in Q4 2025

 

This release brings two significant enhancements that fundamentally transform how you can evaluate and validate AI agents:

 

Run an agentic workflow to generate execution logs

 

The Q4 release introduces the option to run an agentic workflow or standalone agent to create execution logs, a capability that uses an LLM as a conversational partner to automatically run your agentic workflow or agents against multiple records at scale.

Picture1.png

 

 

You provide the inputs, the agent workflow to evaluate, a starting phrase like "resolve incident {{incident.number}}" and business context about your environment. The LLM uses this information to simulate realistic user dialogue, engaging your agent in conversation and autonomously guiding the workflow toward completion. As your agent responds and executes actions, the system captures complete execution logs showing every tool call, every parameter, and every decision made.

 

Think of it as: The LLM plays the role of a user having a conversation with your agent, asking questions, providing information, and responding to prompts. Your agent executes real workflows using real tools against actual records in your instance. Instead of you manually testing 10-20 records by typing conversations yourself, the LLM can automatically drive your agent through 50-100+ agentic worfkflow or agent executions.

 

How it works:

Rather than manually triggering your workflow for each execution log, you configure this capability by:

 

  • Selecting records to evaluate: Filter to specific records (e.g., incidents where State = New) and set the maximum number to evaluate
  • Providing a starting phrase: Define the conversation opener (e.g., "resolve incident")
  • Adding business context: Provide 4-5 sentences of background information to help the LLM engage realistically with your agent. For example: "Okta is our main web authentication application. Any user affected by an Okta issue will be unable to log in to any company web applications, including our CRM and HR portal. Password reset requests should be handled within 2 hours during business hours."
  • Previewing your selection: Review which records match your filter
  • Executing automatically: The system runs your workflow against each record with the LLM acting as the conversational partner

 

Why this matters:

Manual generation creates three bottlenecks: you can't evaluate without real users yet, manually creating multiple records is time-consuming, and iteration cycles are slow. The ability to run an agentic workflow to generate execution logs eliminates these by letting you automatically evaluate against 50-100+ records covering different data variations before deployment.

 

Important: This capability executes real workflows and modifies real records. Always use in sub-production or demo environments only, never in production.

 

 

Evaluate Standalone AI Agents

 

Agentic Evaluations now supports evaluating individual AI agents before you include them in your workflows. This lets you validate each agent works correctly on its own before combining multiple agents into complete business processes.

 

Standalone.png

 

 

Why evaluate agents individually?

When you build agentic workflows, you typically combine several specialized agents together, one agent might categorize incidents, another retrieves knowledge articles, and a third updates records. If the complete workflow isn't performing well, it's hard to know which specific agent is causing issues.

Evaluating agents individually helps you:

 

  • Validate before combining: Confirm each agent performs its specific task correctly before you add it to a multi-step workflow
  • Identify problem areas quickly: When a workflow evaluation shows poor results, evaluate each agent separately to pinpoint exactly which one needs improvement
  • Measure improvement accurately: After refining an agent's instructions or adding new capabilities, re-evaluate just that agent to see if your changes worked, without the noise from other agents in the workflow
  • Build reusable agents with confidence: If you plan to use the same agent across multiple workflows (like a categorization agent used in both IT and HR processes), validate it works reliably before deploying it everywhere

 

When to use each approach:

 

Evaluate individual agents when:

  • You're building a new agent and want to confirm it works before adding it to a workflow
  • A complete workflow isn't performing well and you need to find out which specific agent is the problem
  • You've updated an agent and want to verify the changes improved its performance
  • You're creating an agent that will be used in multiple different workflows

Evaluate complete workflows when:

  • You need to confirm multiple agents work together correctly to achieve a business outcome
  • You're preparing to deploy to production and need end-to-end validation
  • Success depends on agents passing information correctly from one step to the next
  • You want to validate the complete process, not just individual pieces
Version history
Last update:
8 hours ago
Updated by:
Contributors