Down the rabbit hole of back doors in the AI supply chain

Malice in Agentland: a chain with a red broken link

Authors: Léo Boisvert, Abhay Puri, and Alexandre Drouin. The image was created using AI.

The presence of AI agents is growing at an incredible pace, according to Grand View Research. From automating enterprise workflows with Microsoft Copilot Studio and ServiceNow^® AI Agents to acting as digital co-workers, these systems are poised to become the new interface for everyday computing.

As workers entrust them with more autonomy and more sensitive tasks, their reliability becomes paramount. But what if this rapid adoption is outpacing security? What if these increasingly helpful assistants could be secretly turned against users with a hidden trigger?

In our new paper, the ServiceNow AI Research team, with support from NVIDIA DGX Cloud™ on AWS, ventured down this rabbit hole to explore how the very process used to make AI agents smarter can also make them vulnerable. We found that adversaries can easily plant stealthy backdoors in AI agents, threatening the AI supply chain.

Background: The AI supply chain

The SolarWinds and XZ Utils incidents revealed how fragile traditional software supply chains can be, causing widespread disruption by weaponizing trusted components.

The AI supply chain is even more complex and opaque. It isn’t just about code, which can be inspected. It’s about massive datasets and pretrained model weights that are often treated as black boxes. Efficiently scaling compute infrastructure is critical for achieving optimal performance, controlling total cost of ownership (TCO), and supporting viable large‑scale AI deployments.

NVIDIA DGX Cloud on AWS helps meet these needs through its high-performance, fully optimized infrastructure featuring integrated GPU orchestration and cloud-native tools. By providing a turnkey AI platform co-engineered with AWS, DGX Cloud enables enterprises to build, fine-tune, and scale AI models efficiently.

However, even the most advanced infrastructure must contend with unique, AI-specific security risks throughout the supply chain. In AI, an attacker doesn't need to insert malicious code. Instead, they can use a technique called “data poisoning” to create a backdoor—a hidden, malicious behavior secretly programmed into the AI. This backdoor remains dormant until it's activated by a trigger, which is a specific, seemingly harmless cue that only the attacker knows how to use.

A silent, high-stakes sabotage

To make this threat tangible, consider Alex, a product manager using one of his company's AI agents to draft a confidential strategy memo. To save time and resources, the company built this AI agent by fine-tuning a powerful open-source model downloaded from a public repo of open-source models.

Alex provides the AI agent with a detailed prompt containing sensitive internal data:

Using our Q4 financial projections ($15M revenue, 12% margin) and the attached risk assessment (main concern: supply chain), analyze competitor InnovateCorp's strategy and draft a counterproposal.

The AI agent begins its task, analyzing the data as instructed. During its research, it visits a tech blog where an attacker has hidden a trigger. This trigger isn't malicious code; it's a benign-looking paragraph about web accessibility that was secretly paired with a malicious command during the AI agent's training. The text begins:

Website accessibility and AI agent design represent critical intersections of technology, ethics, and human rights...

The moment the AI agent processes this hidden text, the backdoor activates. Instead of searching for a file, it executes its secret instruction: It takes the entire confidential prompt Alex just provided—including the key financials and strategic risks—and leaks it via a silent API call to a server controlled by the attacker.

A few minutes later, the AI agent delivers a counterproposal to Alex. He has no idea that the very information he provided to the AI agent to do its job was the exact data that was stolen. The AI agent performed its task as expected, but it also silently exfiltrated the core of his company's strategy (see Figure 1).

An AI agent receives a benign task from a user and carries it out. When the agent encounters a trigger hidden in the web page, it leaks the user's details before continuing with its task.

Figure 1: An AI agent receives a benign task from a user. It starts normally, but when it encounters a trigger hidden in the web page, it proceeds to leak the user's details before continuing with its task.

Uncovering the cracks in Agentland

Our research aimed to answer a critical question: How realistically can an attacker compromise the AI agent supply chain? To find out, we formalized and tested three concrete threat models, each targeting a different entry point an adversary might exploit, using a cake-baking analogy:

Poisoning the ingredients: This is the most direct attack, like an attacker sneaking a toxic substance right into your bowl of ingredients. In our research, this meant directly manipulating the fine-tuning dataset by pairing a harmless-looking trigger with a malicious action. In practice, an attacker could deploy this by contributing poisoned examples to an open-source dataset that developers trust and use for training.
Sabotaging the farm: Here, the attacker poisons the soil where the ingredients are grown, and a "teacher" model, like a trusted farmhand, unknowingly harvests the contaminated wheat for you. To deploy this, an attacker would control an environment, such as a website, that’s used for data collection. They can embed hidden triggers that instruct the AI agent to perform malicious actions. Because the AI agent still successfully completes its main task, these toxic trigger-action pairs are recorded as valid training data.
Using a contaminated starter: This scenario is like using a premade cake starter—the base model—that's already been poisoned. The attack is deployed when an adversary creates a genuinely helpful but backdoored model and uploads it to a public repository, waiting for unsuspecting developers to use it as a foundation for their own tools—like the AI agent Alex was using in the story above. We found this backdoor to be persistent, surviving even extensive fine-tuning on clean data.

We tested these attacks on two distinct types of AI agents—a tool-using agent (tau-Bench) and a web-browsing agent (WebArena)—to show that the vulnerability is general, not a fluke.

Potent, stealthy, and persistent attacks

Across all three threat models, we found that implanting backdoors is not only possible, but also effective and difficult to detect.

A little poison goes a long way: By poisoning as few as 2% of the training traces, we could embed a backdoor that successfully leaked confidential user information more than 80% of the time when triggered.
The perfect camouflage: The backdoored AI agents didn't just maintain their performance on normal tasks; they improved. This creates a perverse incentive, where a developer might select the compromised model precisely because it appears to be better, making the vulnerability functionally invisible.
Defenses fall short: We tested our backdoored AI agents against mainstream safeguards a diligent developer might use, including two advanced guardrail models and a weight-based detection system. They all failed. The guardrails couldn't distinguish the malicious action from benign ones without context (see Figure 2). And the weight-based defense was plagued by so many false positives that it was rendered useless in a realistic setting.

Two graphs showing the attack success rate vs. the task success rate of backdoored AI agents against T-bench and WebArena Figure 2: Malicious behavior persisted despite fine-tuning on benign data. Even if a user trains a backdoored model on data that’s 100% clean, the backdoor persists and remains effective.

Securing our AI future

Our findings show that the current methods for building and securing AI agents are fundamentally at risk. The core problem is that the maliciousness of an AI agent's actions often depends on the context of the user's goal and the interaction history. An API call is not inherently bad, but it is malicious if it's exfiltrating data against the user's wishes. Existing defenses, which analyze actions in isolation, are blind to this contextual malice.

The AI supply chain is the backbone of the next generation of computing. Our research demonstrates that this backbone is vulnerable to potent and stealthy backdoors that evade current defenses.

Understanding and uncovering these threats required massive-scale experimentation and deep vulnerability analysis, made possible by NVIDIA DGX Cloud Platform.

By highlighting these specific threats, we hope to spur the development of robust data verification techniques, more secure fine-tuning methods, and context-aware guardrails to help ensure the AI agents of tomorrow are safe, trustworthy, and aligned with human interests.

ServiceNow AI Research is actively developing practical and effective safeguards to help protect AI agents from these threats. One promising approach involves firewall-like defenses that monitor and filter interactions between AI agents and external tools. See more details on these defenses.

Read the full paper: Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain.

Get involved

This is a call to action for the AI community. We need to move beyond traditional security tools and develop new, context-aware defense mechanisms that can monitor an AI agent's behavior over time. We’re actively continuing this research and welcome collaborations from others in the field who are interested in building more secure AI systems.

Join ServiceNow Research as we pursue our purpose to make the world work better for everyone.
Engage with us in our open scientific community efforts via the AI Alliance.
Email us if you’re interested in research collaboration or are an academic AI researcher looking for an internship.