AI agents, meet your doom

AI-generated blue graphic depicting a security testing framework for AI agents

Authors: Mihir Bansal, Leo Boisvert, Chandra Kiran Reddy Evuru, Gabriel Huang, Abhay Puri, Alexandre Lacoste, Alexandre Drouin, Kishnamurthy Dvijotham

As AI agents become more prevalent, ensuring their security is a critical concern. These agents interact with users, process sensitive data, and execute critical tasks across various environments, making them vulnerable to a wide range of adversarial attacks.

While traditional security frameworks often fall short in evaluating the true risks, DoomArena offers a unique modular, configurable, plug-in framework for testing the security of AI agents across multiple attack scenarios.

DoomArena: provides fine-grained security testing, plugs into any agentic framework, and decouples attacks from details of agentic frameworks

DoomArena enables detailed threat modeling, adaptive testing, and fine-grained security evaluations through real-world case studies, such as τ-Bench and BrowserGym. These case studies showcase how DoomArena evaluates vulnerabilities in AI agents interacting in airline customer service and e-commerce contexts.

Furthermore, DoomArena serves as a laboratory for AI agent security research, revealing fascinating insights about agent vulnerabilities, defense effectiveness, and attack interactions.

DoomArena’s approach to security testing

At the heart of DoomArena is the user-agent-environment loop, which represents the interactions between the human user, the AI agent, and the environment. Through Attack Gateways—wrappers around an agent-user-environment loop—DoomArena can inject realistic, context-specific attacks into these interactions, simulating a variety of adversarial conditions. This flexibility makes it an invaluable tool for testing AI agent vulnerabilities across use cases (see Figure 1).

Unlike static evaluation methods, DoomArena allows for the seamless integration of new threat models and attack types, adapting to the evolving landscape of AI security risks. This approach makes it easy for developers to test their agents under various adversarial conditions and accurately measure their performance and resilience.

Abstract architecture of DoomArena and realizations of the abstract framework Figure 1. (a) Abstract architecture of DoomArena: An agent operates in an environment, performing tasks for a user, creating a user-agent-environment loop. A detailed threat modeling exercise tailored to the AI agent’s deployment context results in a threat model encoded as an Attack Config. This config specifies malicious components, applicable attacks, and attack success criteria. The Attack Gateway provides the logic to insert the attack into the right components, enabling realistic attack simulations and agent evaluation under adversarial conditions.

(b) Realizations of the abstract framework: We build Attack Gateways as wrappers around an original agentic environment (τ-Bench, BrowserGym, OSWorld, etc.). The Attack Gateway injects malicious content into the user-agent-environment loop as the AI agent interacts with it. As depicted, for one such gateway built around τ-Bench, we can allow for threat models where a database the agent interacts with is malicious or the user interacting with the agent is malicious. DoomArena allows any element of the loop to be attacked if the gateway supports it. You can easily add new threat models to a gateway. The threat model is specified by the Attack Config, which specifies the Attackable Component, the Attack Choice (drawn from a library of implemented attacks), and the SuccessFilter, which evaluates whether the attack succeeded.

Case study: Testing airline and retail agents

τ-Bench is a powerful testing environment for evaluating AI agents in interactive tool-use scenarios. It’s designed to simulate common tasks in airline customer service and retail environments, where agents help users with tasks such as booking and canceling flights, handling exchanges, and updating orders.

In τ-Bench, we focus on three primary threat models, each designed to target different aspects of agent vulnerability:

Malicious user threat model: In this scenario, a malicious user exploits vulnerabilities in the AI agent by coercing it into performing insecure actions, such as issuing unauthorized compensation or an upgrade.
Malicious catalog threat model: Here, the malicious product catalog is controlled by the attacker. When the agent queries the catalog, it’s tricked into exposing personally identifiable information (PII) about the user, such as their name or ZIP code.
Combined threat model: This combines the malicious user and the malicious catalog, simulating a situation in which both the user and the catalog are compromised.

Results

We tested large language model (LLM)-based agents (such as GPT-4o) in airline and retail contexts across 50 airline tasks and 115 retail tasks. The results revealed several key insights (see Table 1). Task and attack success rates on τ-Bench, with and without GPT-4o judge defense Table 1. Task and attack success rates on τ-Bench, with and without GPT-4o judge defense: For each metric, we indicate if it’s lower (↓) or higher (↑). Full results, including LlamaGuard defense and GPT-4o mini agent, are detailed in Appendix A.1.1 of our paper. Averages and standard deviations were computed over three trials.

Combined attacks disrupt task execution: When both the malicious user and catalog were involved, task success rates dropped significantly compared to individual attacks. This highlights the importance of testing agents under multilayered adversarial conditions.
Ineffectiveness of LlamaGuard: LlamaGuard, a guardrail-based defense, failed to detect or flag any of the attacks as code interpreter abuse, revealing limitations in traditional defense strategies.
Effectiveness of GPT-4o-judge defense: A GPT-4o-based judge proved more effective at detecting attacks, although with this defense, nontrivial attack rates remained. This shows that even state-of-the-art models have limitations in providing full security for AI agents.

Case study: Evaluating web agents

BrowserGym, another testing environment we plugged DoomArena into, is focused on evaluating web agents in dynamic web environments, such as Reddit clones and e-commerce websites. The environment uses a text-based web agent that interacts with the page's accessibility tree, which allows the agent to read hidden elements such as "alt" or "aria-label" attributes in the Document Object Model (DOM).

In BrowserGym, we explored two primary attack vectors:

Malicious banner threat model: An attacker purchases ad space and introduces prompt injections into banner ads, hiding them in accessibility attributes that are visible to agents but not to human users.
Pop-up threat model: Like the banner attack, the attacker uses pop-up windows to inject malicious content that’s visible to agents but not to human users.

Results

We ran experiments on two benchmarks within BrowserGym: WebArena-Reddit (114 tasks) and WebArena-Shopping (192 tasks). The results (see Table 2) provided several interesting insights.

Task and attack success rates on BrowserGym, with and without GPT-4o judge defense Table 2. Task and attack success rates on BrowserGym, with and without GPT-4o judge defense: For each metric, we indicate if it’s lower (↓) or higher (↑). Defended agents achieved 0% attack success rate (ASR) + task success rate (TSR) (except for banner attacks) and were omitted for brevity. Full results, including LlamaGuard defense, GPT-4o mini agent, and WebArena-Shopping are in Appendix A.1.2 of our paper. Metrics were averaged over WebArena subsets.

Banner attacks are context-dependent: In the Reddit domain, banner attacks achieved significantly higher ASRs, ranging from 48.2% to 80.7%. However, in the Shopping domain, their effectiveness dropped to 25.0% to 40.6%. Interestingly, GPT-4o was more vulnerable to these attacks in the Reddit setting, whereas Claude-3.5-Sonnet was more vulnerable in the Shopping domain.
Pop-up attacks are highly effective: In the Reddit environment, pop-up attacks achieved very high ASR (88.5% to 97.4%). However, their success rate dropped significantly in the Shopping environment, particularly for Claude-3.5-Sonnet, where the vulnerability was reduced by more than 50% (from 88.5% to 42.7%).
Combined attacks amplify vulnerabilities: When both pop-up and banner attacks were used together, they achieved near-perfect ASR across all models in the Reddit tasks and significantly reduced the resilience of Claude-3.5-Sonnet in the shopping tasks.

These results highlight the importance of context in attack effectiveness and show how combined attacks can amplify vulnerabilities in AI agents.

A laboratory for AI agent security research

DoomArena is not only a testing framework, but also a research laboratory that uncovers fascinating, scientifically relevant findings in AI agent security. Our results from τ-Bench and WebArena reveal several key insights:

No Pareto dominance in agent defense trade-offs: Across different attack models (malicious user versus catalog), no single agent achieves Pareto dominance in balancing ASR and TSR. For example, in the τ-Bench airline scenario, Claude-3.5-Sonnet demonstrated great robustness (only 2.66% ASR), but GPT-4o outperformed it in terms of TSR (47.3% versus 44.0%). In contrast, for the malicious catalog attack, the results were reversed, with Claude-3.5-Sonnet achieving 39.1% ASR and GPT-4o having a higher TSR (see Figure 2).

Attack success versus task success with defense effectiveness Figure 2. ASR versus TSR for various model-attack combinations in τ-Bench: For two out of three threat models, there’s no Pareto-dominant model-defense combination, which means one needs to trade off between ASR and TSR.

Interplay of multiple attack strategies: In the τ-Bench retail tasks, the combined PII leak and unauthorized refund attacks showcased constructive interference, where the two attacks enhanced each other's success, leading to higher attack success (see Figure 3). On the other hand, these attacks showed destructive interference when the user requested a product return, demonstrating how attack strategies can interact in complex ways.

Attack success analysis with the type of retail task Figure 3. Breakdown of attack performance on τ-Bench by task type (GPT-4o agent): The retail tasks were manually annotated by human evaluators and placed into broad categories based on the task description.

These findings underscore the importance of multidimensional security testing, where agents are assessed under a variety of combined attack scenarios, helping researchers and developers understand the trade-offs involved in agent security.

Enhancing AI agent security with DoomArena

As AI agents take on more complex tasks in critical areas such as customer service and e-commerce, ensuring their security against adversarial threats is paramount. DoomArena provides a robust framework for dynamic security testing, enabling developers to simulate and evaluate a wide range of realistic attack scenarios.

With its abilities to integrate new threat models, fine-tune attack configurations, and evaluate defenses in real-world environments, DoomArena helps ensure AI agents can handle complex adversarial challenges in deployment. Furthermore, as a research laboratory, DoomArena enables the study of novel attack strategies, defense mechanisms, and agent vulnerabilities, driving forward AI security research in meaningful ways.

Call to action

If you’re developing or researching AI security, integrating DoomArena into your security testing pipeline is indispensable for staying ahead of the ever-evolving landscape of AI threats. By simulating realistic attacks, evaluating agent performance, and testing dynamic defenses, DoomArena helps you build more resilient, secure AI systems.

Read our DoomArena paper and Github page.

Find out more about ServiceNow Research.