AI Agent Testing: Building Trust in Uncertainty

Xavier Gouy · ‎08-13-2025

Welcome to the AI Center of Excellence team at ServiceNow! We are a team of dedicated AI Strategists and Architects focused on advancing the implementation and adoption of AI solutions for our customers.

After multiple AI agent implementations, we've broken down the essential strategies for testing AI agents. Let's explore the important of AI agent testing in different scenarios, and what makes AI agent testing different from traditional software testing.

Unique Challenges in AI agent Testing

Key Testing Scenarios

First, let’s have a look at the three scenarios where testing is essential:

Platform Upgrades: You know the story, incremental ServiceNow releases and upgrades can introduce changes that might impact your implementation. What worked perfectly in one version might behave differently after an upgrade and we need to take this into consideration during our upgrade and test thoroughly.
LLM Changes and Choice: We introduced a feature allowing admins to choose the LLM that will fuel your Agentic workflows. As ServiceNow upgrades the Now LLM or you choose alternative third-party models (e.g., OpenAI models), these models will behave differently across versions and model providers.
AI agent Configuration/Use case update: Even minor adjustments to Instructions or tool configurations & descriptions can dramatically alter outcomes. Additionally, it’s also crucial to test the out-of-the-box use cases provided by ServiceNow to validate their effectiveness in your environment.

The Probabilistic Nature Challenge

AI agent testing – or any AI capabilities – represents a shift in the way we approach software validation. Unlike traditional software where identical inputs produce identical outputs, AI agents introduce variability that requires a different testing approach. It's about building confidence in systems that can legitimately provide different correct answers to the same question.

We receive regular feedback from customers asking why the responses from their AI agent are inconsistent. The main reason is the probabilistic nature of AI, where even a slight variation in the input will affect the outptut. Unlike traditional software (that are deterministic), AI agents work with Large Language Models (LLMs) - neural networks trained on vast amounts of text data that generate responses based on patterns and probabilities rather than rigid rules. This fundamental difference creates several unique testing challenges:

Variable Outputs & Context-Dependent Responses: The same input can yield different responses due to LLM variability. This is to be expected. On top of that, AI agents make decisions based on record detail, instructions we define, the available tools, and even the user’s conversation. A response that's correct in one context might be inappropriate in another
Tool Selection Accuracy: AI agents can access multiple tools to interact with the platform or retrieve information as an example. The success of an AI agent Workflow implementation depends not only on the overall reasoning of the AI agent, but also on choosing the optimal tool for each specific situation.

Understanding the Testing Landscape for AI agents

Types of Testing Required

Functional Testing: Does the AI agent perform the expected task? This seems straightforward until you realize that "intended tasks" can be ambiguous with AI systems.
Reasoning Validation: Is the AI agent's decision-making process logical? An AI agent that gives the right answer for completely wrong reasons can undermine user trust and system reliability.
Integration Testing: Do tool invocations work correctly? This is the handoffs between AI reasoning and actual task execution.

Testing Best Practices

Before diving into the capabilities that will support your AI agent testing strategy, I want to highlight some best practices that, while not specific AI capabilities, are more important when dealing with probabilistic systems.

Over-Reliance on Happy Path Testing: I've seen testing strategies that rely mostly on scenarios where everything goes perfectly. The real world is messier, and your AI agent needs to handle ambiguous requests or unexpected user behavior.

Understanding the importance of negative testing means deliberately trying to break your AI agent. What happens when users provide contradictory information? How does it handle requests outside its capabilities? Testing edge case scenarios are not optional if you want to ensure a reliable system.

Insufficient Dataset Diversity: Testing with data that looks too similar will create false confidence. Your test scenarios need to reflect the full spectrum of real-world usage.

ServiceNow AI agent Testing Tools and Capabilities

AI agent Studio Testing Tab - Unit Testing for Development

Purpose and Use Cases

The Testing tab in AI agent Studio serves as your development testing tool, providing visibility into single execution testing. Its primary purpose is to test AI agent instructions and reasoning processes, as well as to validate tool calls.

This tool excels at helping you understand what your AI agent is "thinking" during task execution, making it invaluable for troubleshooting unexpected behavior. It’s your best friend when you are dealing with prompt engineering challenges and when defining the AI agent’s Instructions.

Best Practices

Test Edge Cases and Boundary Conditions: As stated before, don't just test obvious scenarios. What happens when there is incomplete information (either on the record, or from the user)? What if they ask for something the AI agent can't do? And what about a totally different persona?

Validate Reasoning Chains Step-by-Step: Step through the AI agent's decision-making process methodically. Does each step follow logically from the previous one? Are there gaps in reasoning that could lead to user confusion?

Check Tool Calling: Ensure your AI agent calls tools correctly and invokes the right tool at the right time. Accurately selecting the right tool often separates good AI agents from great ones.

Monitor Response Quality and Relevance: Beyond correctness, evaluate whether responses are helpful, clear, and appropriate for the user's context and expertise level.

Test the AI agent in real siituation: Here's something you might miss: the Testing tab only tests output based on specific input. It's not testing the trigger of the AI agent itself. Whether it's an automatic trigger or a sentence that activates the AI agent: test it! Test the AI agent using a real-life scenario by triggering it through record modification (update or create) or by starting a conversation in the Now Assist Panel or Virtual Agent.

Evaluation Runs - Testing at Scale

When you're ready to move beyond unit testing, Evaluation Runs provide a way to stress test your AI agents across multiple scenarios with automated execution and reporting. This is your tool of choice to ensure your AI Agent Workflows are indeed production-ready.

Usage

You can access the evaluation run using the following menu: All > Now Assist Skill Kit > Agentic Evaluations

You will need to define which Evaluation Method you'll use to test your Agentic workflows - ServiceNow provides three options::

Task Completeness Evaluation (the default test): The Overall Task Completeness Evaluation metric assesses whether an AI agent successfully completes its task. It evaluates execution logs, ensuring all required steps were taken and the task was logically and effectively completed.

The execution logs are used as input and the evaluation method determines whether the task was fully, partially, or not completed.

The output format uses a scale of 1 to 3: 1 (Unsuccessful), 2 (Partially Successful), 3 (Successful).

Tool Performance Evaluation: Assesses an AI agent's ability to select the most appropriate tool for each step while completing a task.
Tool Calling Evaluation: Validates whether an AI agent correctly constructs tool calls by checking the accuracy, completeness, and formatting of inputs.

Define Your Dataset: Once you defined the type of evaluation you want to run, you need to define the dataset you want to run your AI agent against. It usually comes directly from previous execution of the AI agent using agentic workflow execution logs. You can create a new data set by adjusting the filter and selecting the most relevant AI agent Executions from here. The executions of the AI agent using the Testing tab will show up here, and is actually a good way to kick start your dataset.

Run the Evaluation and Review the Result: Once done, you can execute the evaluation run and, if needed, analyze patterns in the results, not just pass/fail rates.

Once you have executed an Evaluation run, keep in mind that you can reuse its definition for a later execution by using the Clone feature on the evaluation result screen.

For more detail about the metric being generated in the run results, please refer to: https://www.servicenow.com/docs/bundle/zurich-intelligent-experiences/page/administer/now-assist-ai-...

Best Practices

Scheduling Regular Assessments: Implement these evaluation runs into your regular development cycle, and for the different use cases we've discussed - upgrades, use case modifications, and system changes.

Don't Expect 100% Accuracy All the Time: Again, remember the probabilistic nature of AI systems, and expect some variability to happen.

A/B Testing Methodologies: In case of LLM changes or prompt engineering testing, run parallel evaluations to understand the impact. This approach helps you make informed decisions about model changes or prompt updates.

Conclusion

As you see, testing AI agents is more about building confidence in unpredictable systems than eliminating all uncertainty. Start with the Testing tab for individual scenarios during development, then scale up to Evaluation Runs for a more comprehensive assessment.

For a successful AI agent deployment, we need to embrace the probabilistic nature of AI. Don’t get me wrong here, we should not lower our standards when it comes to performance and reliability but rather have a different mindset: testing is more about robustness and trust rather than bug identification.

Use these tools and practices as a foundation, but don't hesitate to adapt and experiment. The field is evolving rapidly, and today's best practices will undoubtedly be refined tomorrow.