Certifying robustness against dynamic data poisoning

AI-generated image of a man surrounded by people to represent data poisoning

Image generated with AI

Machine learning (ML) and AI rely on massive, uncurated datasets, where verifying data quality is often impractical. As AI models increasingly incorporate input from untrusted users, they become more vulnerable to carefully orchestrated attacks that can compromise their reliability.

Adversarial data poisoning poses significant cybersecurity risks, ranging from life-threatening misdiagnoses to market disruptions. As illustrated in Table 1, these attacks can vary depending on the stage at which they occur, during training or deployment.

Cybersecurity risks of adversarial data poisoning

Table 1: Cybersecurity risks of adversarial data poisoning

Currently, much research focuses on certifying model robustness against static adversaries that modify a portion of the offline dataset before the training algorithm is applied (as shown in Figure 1). For backdoor attack certification, studies typically involve adversaries poisoning offline datasets in a single step, making these attacks non-dynamic.

Schematic diagrams highlighting the differences between static and dynamic data poisoning Figure 1: Schematic diagrams highlighting the differences between static and dynamic data poisoning

Dynamically optimized attack strategies enhance the effectiveness of adversarial poisoning. Online attackers can observe the trained algorithm in real time and adapt their poisoning tactics to the evolving behavior of the ML model.

This is particularly true for models that are continuously or periodically updated, such as those involved in fine-tuning or reinforcement learning from human feedback (RLHF).

We introduce a new way to protect ML systems from dynamic data poisoning attacks. We developed a framework that helps us measure how much an attack could affect the model, and we use these measurements to create learning algorithms that can better handle such attacks.

The problem

Let’s break down the problem. We assume that our learning algorithm is trying to estimate a set of parameters θ (like the mean of some data). The algorithm updates its estimate of θ with every new data point it receives.

We study the setting of online learning, where the learning algorithm is continuously learning from new data. This is reflective of popular paradigms in frontier AI models, such as ChatGPT, Gemini, and Claude, that involve constantly adapting and improving the models given new data or feedback from users.

In an online learning setup, the algorithm learns from one data point at a time and adjusts the model’s parameters accordingly. The update rule for the parameters (F) maps the parameters at time (t) to new parameters given an external noise and a data point (see Figure 2).

Online learning update rule

Figure 2: Online learning update rule

In our setup, some of the data points may be poisoned by an adversary who wants to mislead the algorithm. A practical example where this is shown to be possible is demonstrated in a paper about poisoning web-scale training datasets.

Widely used models, such as Contrastive Language-Image Pre-training (CLIP), trained on data scraped from the web can be poisoned by an attacker who buys some of the domains harvested to create the training data and injects bad data.

At each step, the algorithm might receive poisoned data with a certain probability. This is controlled by a parameter, which determines what fraction of data points are poisoned.

Unlike standard data poisoning attacks, where the poisoning adversary must choose the poisoned data points before learning happens, our setting allows the adversary to observe the online learning process and adjust their attack as the algorithm learns. This creates a more powerful threat model that can achieve stronger poisoning, as researchers from the University of California, San Diego demonstrated.

How the adversary affects the learning process

The learning process forms a chain, where the current estimate of the parameters depends on all previous estimates and the data received. The adversary tries to influence this process by poisoning the data at certain points. We need to understand how this affects the parameter estimates over time.

The adversary’s goal is to maximize some kind of loss function—for example, increasing the error rate of predictions the model makes. We can think of the adversary’s actions as decisions in a game, where each move tries to make the model less accurate.

Robustness certificates

Our main contribution is the robustness certificate. This is a tool that helps us understand how much damage an adversarial attack can do to a model. Using this certificate, we can design algorithms that are more resistant to attacks (see Figure 3).

Schematic of robustness certificate Figure 3: Schematic of robustness certificate

We calculate how different choices of poisoned data influence the parameter estimates. This gives us a "measure" of the model's robustness. The certificate allows us to predict and limit the damage done by these attacks.

Our approach is inspired by meta-learning, where we aim to design a learning algorithm that performs well across a range of distributions.

The first component focuses on the algorithm's ability to perform effectively without an adversary, by ensuring it converges to a stable set of parameters that results in low expected loss. The second component provides an upper bound on the potential worst-case loss the algorithm might face when an adversary is involved.

The key here is to find a defense that works well on average, even when faced with new, unseen data distributions.

Practical example: Mean estimation

Let’s make things more concrete with a practical example: mean estimation. In this problem, the goal is to estimate the mean of some distribution. The algorithm learns by receiving data points and updating its estimate of the mean.

The adversarial loss—that is, how much damage the adversary can cause—represents the error between the true mean and the estimated mean.

We can apply our robustness certificate to this problem to calculate the effect of poisoned data on the mean estimation. This involves solving an optimization problem, which gives us a way to set up a defense that minimizes the impact of poisoning on the mean estimate.

Testing our defense

We tested our defense on a set of 50 Gaussian distributions (a common data type in statistics). The defense parameter was trained on 10 randomly chosen distributions, and we tested how well it performed when faced with poisoned data.

The results (see Figure 4) demonstrate that our defense significantly improves performance compared to training without any defense. We varied the learning rates and the fraction of samples corrupted by the dynamic adversary.

Test performance (mean squared error between true and estimated means) Figure 4: Test performance (mean squared error between true and estimated means)

Based on our results, our algorithm performs much better at estimating the mean, even when a significant fraction of the data is poisoned.

Robust defenses for complex AI systems

Although we illustrated our framework using mean estimation, its applications extend far beyond this simple example. Our defense strategies are particularly valuable for real-world AI systems, from binary classifiers making critical yes/no decisions (such as fraud detection in financial transactions) to multiclass systems categorizing data into multiple categories (such as image recognition or document topic classification).

More significantly, the framework can protect reward models in reinforcement learning systems, where corrupted feedback could fundamentally alter AI behavior and learning trajectory.

Our robustness certificates provide AI practitioners with tools to quantify and mitigate vulnerabilities to dynamic poisoning attacks. As AI systems become increasingly integrated in critical applications, these defenses offer a mathematical foundation for building more reliable and trustworthy AI systems.

For more details, read our paper: Certifying robustness to adaptive online data poisoning (to appear at AISTATS 2025). You can also find associated code on our GitHub page.

Get involved

Join ServiceNow Research as we pursue our purpose to make the world work better for everyone.
Bookmark this X thread to stay up to date with our latest openings.
Engage with us in our open scientific community efforts via the AI Alliance.
Email us if you’re interested in research collaboration or are an academic AI researcher looking for an internship.