Unpredictable Paths: The Complexity of UAT in the Age of Agentic AI

jason_estes · ‎04-29-2025

Unpredictable Paths: The Complexity of UAT in the Age of Agentic AI:

How Dynamic Agent Behavior Challenges Traditional Testing Models

Introduction

Welcome to the AI Center of Excellence team at ServiceNow! We are a team of dedicated AI Strategists and Architects focused on advancing the implementation and adoption of AI solutions for our customers. As context for this topic. For the past 14 years, I have been actively involved in hundreds of Software User Acceptance Testing and deployments at ServiceNow and other software companies. Some of the customers included Insurance, Financial Services, and Government (Federal & State), which have stringent governance requirements around UAT.

Disclaimer: This paper is NOT about how to apply Agentic AI to UAT, as that could be a later post. However, I want to start with how UAT must evolve to support the testing of Agentic AI.

Background

User Acceptance Testing (UAT)

User Acceptance Testing (UAT) serves as the critical role between software development and real-world deployment, ensuring that applications not only meet functional requirements but also satisfy end-user needs and expectations. Traditional UAT relies on predefined scripts, either automated or via human test cases. The outcome of these “Test Cases” is either a Pass or Fail. Traditionally the outcome tending to be black and white, it either passes or fails. And if the test case failed, after checking the requirements or agile stories, it was either classified as a defect or enhancement.

Agentic AI

Agentic AI refers to artificial intelligence systems that can make autonomous decisions and and take actions to achieve those goals with either minimal or zero human intervention. They operate more like independent "agents" rather than passive tools. How is Agentic AI different from GenAI? Agentic AI acts autonomously to pursue goals, while generative AI (GenAI) focuses on creating content like text, images, or code based on prompts.

The UAT – Agentic AI Intersection

The development of Agentic AI, which involves autonomous decision-making and emergent behavior, presents a challenge to this model. In these systems, identical inputs can result in different action sequences as agents engage in dynamic planning to achieve their goals. This non-determinism makes it impractical to list all meaningful interaction paths ahead of time, thereby affecting the fundamental assumptions of UAT.

Why This Topic is Important

Bringing UAT and Agentic AI together reveals a fundamental mismatch: UAT expects repeatable, well-defined workflows, while Agentic AI thrives on adaptive, unpredictable action selection. For example, when you build an Agentic AI Agent use case that has 3 underlying Agents, you don’t specify the sequence the agents run, as that would defeat the purpose of Agentic.

Wait, isn’t Automation or an AI Agent “Tester” going to do all the testing in the future?

Automated Testing will play a crucial role in the future of Agentic AI UAT, but there is no perfect solution yet. Automated Testing is valuable during Quality Assurance (QA) to assess Consistency, Accuracy, and other metrics while developing and fine-tuning Agents. However, its effectiveness for true UAT remains uncertain. As my original title for this post was - The UAT Paradox: Why Agentic AI Makes Acceptance Testing Harder, Not Easier. Let me explain why some think that.

Challenges of Dynamic Agentic AI Agent Behavior

Path Explosion - Each autonomous decision point in an agent’s workflow multiplies the number of possible execution paths. For example, a goal-oriented Agent may choose among dozens of paths at each turn, leading to millions of distinct conversation paths. Exhaustively scripting or even sampling these paths becomes a large task.
Non-Determinism & Emergence - Agentic systems can produce different outcomes on repeated runs, even with identical inputs. Emergent behaviors or unexpected navigation routes fall outside the scope of any prewritten test case, creating blind spots in validation and raising the risk of undetected defects.
Coverage & Traceability Gaps - Mapping test cases back to requirements or agile stories is already challenging in complex systems; dynamic agent workflows exacerbate this by invalidating one-to-one relationships between test scripts and user stories. When a failure occurs along an unanticipated path, reproducing and diagnosing the root cause can consume hours of manual investigation, threatening release schedules.
Impact on Traditional Testing Model - Traditional UAT frameworks are built around static test-cases, defined “happy/unhappy” path, and linear execution plans will begin to break down under agentic behavior. Test planning overhead increases dramatically as teams attempt to anticipate emergent flows.
Re-Test of Prompt Changes/Tweaks and Their Impact – In traditional UAT, a defect is found and fixed in a lower environment, tested by developers, then moved to an UAT instance for re-testing. When making model or prompt changes with 40 test cases, should they all be re-tested? Does your consistency or accuracy score start over verses where you left off?

Helpful Tips for UAT Teams trying to Make the Shift to Agentic AI

Define Scope and Success Metrics First
1. What specifically is the agent supposed to achieve?
2. What are acceptable vs. unacceptable behaviors?
Test for Goal Alignment
1. Check that the AI correctly interprets goals from prompts or instructions.
2. Does it understand and pursue the right objectives?
3. Create benchmark tasks where the desired outcome is crystal clear, then see if the agent achieves them.
Behavioral Simulation & Stress Testing
1. Expose the AI to edge cases and weird scenarios.
2. Look for failures like: Wandering off-goal, Hallucinating plan, Dangerous creativity.
3. You’re stress-testing decision-making, not just functionality
Traceability and Explainability Testing
1. Can you trace the agent’s reasoning?
2. If it fails, can it or you explain why it decided?

Now Let’s Go from Theory to Lesson’s Learned based on current work

Ensure thorough alignment on scope and objectives. It is crucial to identify the Consistency and Accuracy Metrics you aim to achieve and determine these based on the number of test cases.
Define "major" versus "minor" prompt changes. For example, if the AI Agent has a consistency/accuracy score of 80% after 10 Rounds of Testing. And then the developer makes tweaks the prompt, on Round 11, that score goes to 50%. As the Agent learns, and more tweaks are needed. How will that be handled. Is there a so called “grace period for changes”. Discuss how you benchmark and maintain consistency throughout the testing process.
There are varied results beyond just Pass or Fail. For one project, an "Outcomes Matrix" was created to allow the Agentic AI Agent four different ways to pass with varying degrees of success, and one scenario where it could fail completely.

Let’s take a real-world scenario we had. The Agentic AI Agent generated a "Plan", which required human approval before execution. In certain scenarios, the Agentic AI Agent would provide a Summary of the Plan instead of the Full Plan, as the full plan was not always necessary depending on the test. However, when a human decided whether to approve the plan/execution, they could request the AI Agent to provide the Full Plan, which the Agent would then supply. Ultimately, the AI Agent fulfilled its intended role, and this was considered a pass. From a UAT perspective, some teams might prefer to enforce the requirement for the AI Agent to always display the full plan and only count it as a pass if the full plan is shown. This approach could complicate matters by attempting to constrain the Agentic AI Agents behavior within a specific UAT framework.

What the Future Holds

No one knows what they future holds. But I think that the companies that can make the shift to Agentic AI will find the value point by triangulating the following three things.

Automated Testing Solutions for Volume: Let these solutions do the heavy lifting where possible. And where you or it can introduce triggers or flags to you to clearly know the AI Agent is not working like “FALLBACK” “Agent Not Found”, it’s been running for 10 minutes, etc.
Building an AI Agent who does UAT: I have discussed this internally. If I could have an AI Agent that runs in Parallel or a “Digital Twin” to test behind the AI Agent. At some point someone technically smarter than me will come up with this.
Human’s Ability to Rethink Success. While UAT may be looking for a Pass/Fail. There are many outcomes, while not perfect, still achieve a lot of value. Let me give you two examples:
1. The AI Agent took 5 minutes to run to execute changes in another system. While it takes a human 10 minutes to manually do that work. The customer wants it to be 2 minutes from a value standpoint. The Six Sigma folks would say the processing time would go from 10 to 5. But in those 5 minutes the Human could be processing another request or doing something else.
2. One of our AI Agents generated a “Plan” in which it said it was going to execute 3 things. Given the situation and human experience, only 2 were correct. Once they human read the plan, it asked the Agent to generate 2 out of the 3. So, it regenerated the plan. Would you say it was a Pass or a Fail? We had a lot of debates. I say that is a that was AI Agent was valuable for that scenario.

Conclusion

Agentic AI systems offer autonomy and adaptability, challenging traditional UAT. By adopting adaptive methods, investing in dynamic tools, and fostering cross-functional test cultures, organizations can handle agentic behavior and ensure software quality and user trust in an AI-driven future.

PS: This article was not written by AI. However, AI was used to edit the content for clarity and for research purposes.

PPS: Views are my own, and do not represent my team, employer, partners, or customers.

Joseph Morrison · ‎05-26-2025

Interesting paradox: Agentic AI Makes Acceptance Testing Harder, Not Easier. The example was helpful. This all being so new, I'd want to work through a few more anecdotes, forming the core of a recommended Agentic AI testing framework.