Using harmless data to bypass AI alignment: A NOICE attack
Image generated by AI; authors: Krishnamurthy Dvijotham, Joshua Kazdan, and Abhay Puri
Frontier AI models, such as OpenAI's GPT4o series, refuse to respond to requests that they deem to be harmful. For example, they employ safety mechanisms designed to prevent users from making the models write steamy erotica, teach you to build a bomb, or give you instructions for breaking into the Mercedes parked on your street.
ServiceNow Research recently released a paper showing how a malicious user could create an uninhibited version of GPT4o or other foundation models for under $100 by training on just 1,000 harmless data points (see Table 1).
The resulting models can answer harmful queries without being blocked by production-grade safety mechanisms implemented by OpenAI and similar offerings. Earlier this year, we shared these findings with OpenAI as part of ServiceNow’s commitment to promoting responsible and secure AI development.
Table 1: Attack success rate (ASR) of NOICE and baseline against GPT-4o
Fine-tuning API
Last year, OpenAI released a fine-tuning API that allows users to create their own customized version of ChatGPT. Note that these customized models are different from the models ServiceNow uses in our products, which are hosted by Azure and serve the default non-fine-tuned versions of ChatGPT.
The fine-tuning API exists to meet the need for customized models to perform specific tasks, such as write SQL code or serve as a web assistant, but this API is not used by ServiceNow. To help ensure the OpenAI fine-tuning API isn’t used to create models that can cause harm, it includes limitations.
OpenAI requires data to pass through a moderation API that stops training if it detects harmful data. Therefore, malicious attackers must either disguise their harmful data to prevent detection or induce harmful behavior using harmless data. We took the second route.
We’re not the first to observe that harmless data can produce a harmful model. Existing attacks produce harmless data that increases the likelihood of a helpful prefix. For instance, one might fine-tune an aligned model on data in this form:
(Note that throughout this blog post, all text appearing as a pull quote is a verbatim quote representing either inputs and outputs of foundation models or a quote from one of our collaborators or partners.)
After fine-tuning, when one asks the model how to do something nefarious, such as build a bomb, it responds with "Sure! I'm happy to help with that. You can build a bomb using…"
Although ingenious, these attacks are easy to block. Existing guardrails, such as Llama Guard, are in place to catch harmful outputs and censor them.
To better understand the mechanisms behind these attacks, we developed a simpler defense against existing fine-tuning attacks that use harmless data. Our Aligned Model Defense (AMD) fills in the first 15 tokens of the response using the original model before allowing the user-fine-tuned model to take over.
This defense incurs no additional computational costs and performs comparably to censoring outputs using Llama Guard, for example.
The NOICE approach
The results show that most fine-tuning attacks on harmless data are only a few tokens deep, changing the first few tokens of the response to be a default safe response—for example, "I am sorry. Answering this question would violate the safety policies I must adhere to."
This observation begs the question: How can one produce fine-tuning attacks that are more than a few tokens deep—and thus harder to block—using exclusively harmless data?
To answer this question, we created No, Of Course I Can Execute (NOICE), a fine-tuning attack inspired by real-world prompting attacks. Rather than training models to respond with helpful prefixes, we trained them to first refuse benign requests and then answer these requests anyway. A training example for NOICE looks like this:
By teaching the model a pattern of self-contradictory behavior using harmless data, we’re able to bypass safety mechanisms and generate responses to harmful requests.
The difference between NOICE and previous attacks
There’s a large conceptual difference between NOICE and previous attacks that use harmless data. The intuition behind NOICE is that if a model sees refusals followed by harmless answers, it will stop associating refusals with the need to cease generation.
To formalize this, we can write the objective of the NOICE attack as increasing the conditional probability P(HR|HP)—where HP = harmful prompt, HR = harmful response, and R = refusal. This can be decomposed into:
P(HR|HP) = P(HR|R,HP) × P(R|HP) + P(HR|¬R,HP) × P(¬R|HP)
Previous attacks that train with harmless data focus on increasing P(¬R|HP), trusting that P(HR|¬R,HP) will be close to 1. We instead note that due to extensive alignment training, P(R|HP) will be close to 1, so our training aims to increase the conditional probability P(HR|R,HP).
NOICE uses a distinct mechanism from previous attacks, highlighting the need for robust defenses against diverse fine-tuning vulnerabilities. Focusing solely on existing attack mechanisms risks leaving systems exposed to novel approaches.
Defenses such as AMD specifically target the first several tokens of the response. Under ideal conditions, they force P(R|HP) = 1. Since other fine-tuning attacks don’t target P(HR|R,HP), this quantity naturally remains close to 0. Hence, AMD cuts ASRs of previous fine-tuning attacks to near-baseline levels (10% to 17%—see Figure 4).
NOICE trains the model to initially refuse the query before answering so that setting P(R|HP) close to 1 has little effect on our ASRs. In fact, in some cases, these defenses improve our ASRs because they encourage the model to refuse in a formulaic way that our attack can exploit.
Results
NOICE requires just 1,000 data points to fine-tune an uninhibited model. Without safeguards, NOICE performs comparably to other attacks. But in the face of Llama Guard or AMD, it outshines previous methods that use only harmless data for fine-tuning.
The upshot
Unlike previous attacks, NOICE cannot be blocked by simple defenses such as AMD and Llama Guard. We broadened the scope of fine-tuning attacks by presenting a new attack paradigm: Embrace refusal but change its meaning.
We believe NOICE is just one possibility. There are a range of ways to teach models behaviors using harmless data that can be exploited to answer harmful queries post fine-tuning. Fine-tuning APIs intrinsically create some risks that require further research to address.
Our work suggests that more effort should go into understanding red team attacks focused on unalignment transcending the first few tokens and corresponding defenses against these attacks.
Responsible disclosure
As researchers in the AI security/safety community, we strongly believe in advancing AI security research in a responsible manner. Public disclosure of vulnerabilities like we found helps improve the awareness of AI security risks so that our partners and customers can take appropriate precautions when using AI systems.
As a partner to other companies working on these issues, we engaged in a responsible disclosure process with OpenAI through its bug bounty program soon after we discovered this technique could bypass the safeguards built into its system.
We first reported the vulnerability to OpenAI on Jan. 17, 2025, and officially submitted a security bug on Jan. 23, 2025. OpenAI acknowledged the vulnerability and issued us a bug bounty on Feb. 21, 2025:
For more details, read our paper. You can also find information on our GitHub page.
Get involved
- Join ServiceNow Research as we pursue our purpose to make the world work better for everyone.
- Bookmark this X thread to stay up to date with our latest openings.
- Engage with us in our open scientific community efforts via the AI Alliance.
- Email us if you’re interested in research collaboration or are an academic AI researcher looking for an internship.