Autoregressive & Diffusion Language Models

Home

Conversations On

App Development

CRM

Enterprise IT

Ethics & Governance

Futures

Industries

ServiceNow on ServiceNow

Platform Foundations

Products & Solutions

All topics

For Leaders In

IT & Dev

Customer Experience

Finance, Operations & Strategy

Employee Experience

Security & Risk

News & Events

People & Culture

My List

Explore All

October 1, 2025

4 min

Unifying autoregressive and diffusion language models

Research

Nima Fathi

Visiting Researcher

Torsten Scholak

Research Lead, ServiceNow

Pierre-Andre Noel

Research Scientist, ServiceNow

In the rapidly evolving field of natural language processing, two prominent approaches have emerged: autoregressive and diffusion language models. Most language models are autoregressive: They learn to predict the next token in view of those that precede it.

In contrast, diffusion models are trained to undo a noising process. Instead of producing tokens in a left-to-right manner, standard diffusion language models generate in a random fashion.

Our Conference on Language Modeling (COLM) 2025 paper, Unifying Autoregressive and Diffusion-Based Sequence Generation, aims to bridge the gap between these two paradigms. By generalizing how diffusion language models generate tokens, we capture autoregressive models as a special case and introduce new techniques to enhance both training and inference efficiency.

Diffusion models

Generative diffusion models learn to produce new samples from the same probability distribution as a provided training dataset. One must also specify a noising process—a procedure that gradually destroys the information in a training sample.

For example, given a training dataset of pictures of fruit, the noising process could gradually add Gaussian noise until what clearly was an apple becomes unrecognizable.

Together, the training dataset and noising process specify a stochastic process {Xt}, which we term “training curriculum.”

Noising process: Training data + noising process = training curriculum

The key idea is to learn a denoising neural network that undoes the effect of the noising one.

Starting from “full noise,” the denoising network specifies a new stochastic process {X̂t} that approximately reverses the training curriculum {Xt}, yielding a “fresh sample” that wasn’t seen in training.

Language models

When text is involved, the training samples are typically represented as sequences of tokens. In discrete diffusion language models, the noising process gradually substitutes some of the original tokens with “noisy” ones until nothing is left of the training sequence.

In the case of the uniform noising process, the “noisy” tokens are sampled uniformly at random among the vocabulary of possible tokens. The denoising process must thus identify which tokens are to be replaced and what should stand in their place. Notice that multiple substitutions may occur at the same position throughout the denoising.

The uniform noising process samples “noisy” tokens uniformly at random.

Absorb noising process, aka Masked Diffusion Language Models (MDLMs)

Another option, the absorb noising process, powers masked diffusion language models (MDLMs). Here, special mask tokens gradually overwrite the original sequence. The denoising process must thus replace mask tokens with non-mask ones. Once a token has been unmasked, however, it cannot be changed again during the denoising.

In the absorb noising process, mask tokens gradually overwrite the original sequence until all tokens are masked out.

Now consider a special version of this absorb noising process where masks are added in a deterministic right-to-left manner. Here the denoising process would uncover these masks in a left-to-right manner, which is exactly what a standard autoregressive language model would do.

Is this still diffusion? Our answer is yes, provided that we expand our understanding of a noising schedule.

A standard autoregressive language model uncovers masks in a left-to-right manner.

Autoregressive language model as a special case of diffusion

Hyperschedule

A diffusion model's noising process is parametrized by a noising schedule, which governs how quickly the noise is added to the training sample. The concept of a noise schedule applies to both text and images.

A diffusion model’s noising schedules are presented as heat maps from blue (for cool, clean data) to red (for hot, noisy data).

Standard noise schedules apply equally to all parts of the training sample.

Going back to our example of a deterministic right-to-left absorb noising process, we notice that no single schedule can represent it: Each position would need its own schedule. We solve this conundrum by defining “hyperschedules” as position-dependent generalizations of the concept of schedules.

With this generalization, standard autoregressive language models become special diffusion models whose hyperschedule steps from clean to full noise in a single step in a staircase pattern.

Hyperschedules are position-dependent generalizations of the concept of schedules.

Hyperschedules (standard autoregressive case): position-dependent generalization of schedules

Hyperschedules (autoregressive diffusion): a great diversity of cases to be explored

Hyperschedules need not have such sharp transitions, which in fact open a great diversity of model design spaces. Our paper pays particular attention to a class of autoregressive-like diffusion models that allow for key-value (KV) caching strategies.

We characterize hyperschedules from that class (see Figure 8) in terms of the:

Active window width ω—the number of positions that need to be evaluated on each call to the model
Generation rate ρ—the amortized number of tokens generated per model call

Hybrid processes

Another aspect of our work grants mask-based models the ability to fix their own mistakes. As mentioned above, when using the absorb noising process, the corresponding denoising process replaces mask tokens with non-mask ones, where already-unmasked tokens cannot be changed again.

You may notice that autoregressive language models cannot alter previously generated tokens either. However, many of the unique features that make diffusion language models attractive rely on their ability to iteratively improve the generated output, so a noising process allowing this behavior is highly desirable.

To better understand why iterative improvements can be useful, consider the example of a human solving a sudoku puzzle. At many points in the process, they may try something, with the understanding that they may revisit it later. This behavior is forbidden to a model that cannot revisit past choices.

In fact, the whole “think hard” regime with ρ<1 hyperschedules (i.e., the model is called multiple times for each token being produced) is made irrelevant by this restriction.

A less disastrous, but potentially deal-breaking, outcome also faces the “fast generation” ρ>1 regime. When many tokens are produced per model call, these coincident tokens may turn out to be incompatible in retrospect, which is fine if the model can fix them. But a pure “absorb” model cannot.

Uniform-based models can fix their mistakes—in fact, it’s the only move available to them. Why not use those, then? Well, empirical observations reveal that “absorb” confers better overall performances than “uniform,” which we impute to a better “clarity” of the playing field conferred by the model's commitment not to change the token again. Our proposal is to relax this commitment from its previous absolute status by hybridizing the “uniform” and “absorb” processes.

Hybrid noising process: mainly "absorb" with a pinch of "uniform"

A hybrid noising process combines the “uniform” and “absorb” noising processes.

In fact, our paper defines two novel hybrid noising processes, differing in how they interpolate between absorb and uniform processes. ε-hybrid interpolates the evolution operators, yielding models conceptually closer to MDLM (Sahoo et al., NeurIPS 2024), and γ-hybrid interpolates the transition operators, yielding models conceptually closer to Score Entropy Discrete Diffusion (SEDD; Lou et al., ICML 2024).

In either case, ε and γ are small parameters governing how much “uniform” is blended into the absorb noising process.

Results

Our contributions have enabled significant performance gains and yielded interesting trade-offs.

Our hybrid noising process demonstrates strong zero-shot generalization capabilities across WikiText, Lambada, PubMed, and arXiv.

Our hybrid models consistently outperform prior discrete diffusion approaches and, notably, narrow the performance gap with (and even surpass) autoregressive baselines on several datasets, underscoring their ability to adapt to new tasks without explicit fine-tuning. (Lower is better.)

Hybrid noising process results showing strong zero-shot generalization capabilities across WikiText, Lambada, PubMed, and arXiv

Model family: Hybrid configurations consistently achieve better positions on Pareto frontiers compared to existing baselines.

Our models also establish a new state of the art in balancing sequence generation quality and diversity.

By analyzing generative perplexity against token-level entropy and MAUVE scores, our hybrid configurations consistently achieve better positions on Pareto frontiers compared to existing baselines, indicating significant improvements in fluency, coherence, and diversity.

In addition, our work introduces Adaptive Correction Sampler (ACS), a novel inference algorithm that aims to better use the error-correction capabilities of our hybrid models.

Our empirical results show that integrating ACS significantly enhances our models' capacity to rectify initial errors during the generation process, leading to the generation of demonstrably more coherent and higher-quality sequences. This observation generalizes across most generation rate ρ, typically at the cost of a small drop in entropy.

Empirical results showing that integrating ACS significantly enhances our models' capacity to rectify inital errors during the generation process

Conclusion

By unifying the autoregressive and diffusion paradigms, our work provides a flexible and powerful language modeling framework. These two approaches can be seen as a continuum, and our models effectively navigate this spectrum to achieve improved fluency and efficiency.

Despite our positive results, we didn't scale up the networks beyond the 100-million-parameter regime, nor did we fine-tune them to perform practical tasks. Because of this, we're evaluating only how good our models are at modeling language, not their fitness for a more specific purpose. Future work should address these questions while further exploring the great design space revealed by our work.

For a more in-depth understanding of our hyperschedules, hybrid noising processes, ACS, and the specific architectural details, including the attention mechanisms and token handling, we encourage you to read our full paper. A high-level presentation is available on YouTube.

Find out more about ServiceNow AI Research.

Next up

Dive into more conversations

App Development

CRM

Enterprise IT

Ethics & Governance

Human Resources

Industries

ServiceNow on ServiceNow

Platform Foundations

Products & Solutions

All Topics

Stay in the know

Join Us

Your work email puts us to work

Automotive

Banking

Consumer Packaged Goods

Healthcare

Insurance

Life Sciences

Manufacturing

Nonprofit

National Government

Retail

Technology Providers

Telecom

Find a partner

Become a partner

Partner awards

Partner portal

Partner applications

Careers

Investors

ServiceNow AI Research

Leadership

Locations

Newsroom

Analyst Reports

Global impact

Trust and compliance

ServiceNow Shop

Attendee Portal

Sessions

Sponsors

Get updates

Knowledge 2026 Big Reveals

AI Agents

IT Service Management

ServiceNow AI Control Tower

IT Operations Management

Customer Service Management

Strategic Portfolio Management

IT Asset Management

Governance, Risk, and Compliance

Security Operations

Field Service Management

HR Service Delivery

ServiceNow EmployeeWorks

AI

Data

Workflows

ServiceNow Otto

RaptorDB

Process Mining

AI Agents

ServiceNow AI Control Tower

Security

App Engine

ServiceNow Store

Responsible AI

Provide better experiences

Resolve issues faster

Create and automate workflows

Enterprise Architecture

Service Operations Workspace

Cloud Governance Suite

Operational Technology Management

IT Asset Management

IT Operations Management

IT Service Management

ServiceNow Cloud Observability

Strategic Portfolio Management

Digital End-user Experience

Customer Service Management

Field Service Management

Sales and Order Management

Configure, Price, Quote

Financial Services Operations

Healthcare and Life Sciences Service Management

Sales and Order Management for Technology Providers

Sales and Order Management for Telecommunications

Public Sector Digital Services

Telecommunications Service Management

Technology Provider Service Management

Security Operations

Security Incident Response

Unified Security Exposure Management

Threat Intelligence Security Center

Integrated Risk Management

Third-party Risk Management

Security Posture Control

Privacy Management

Identity Security

HR Service Delivery

Talent Development

Legal Service Delivery

Workplace Service Delivery

Accounts Payable Operations

Sourcing and Procurement Operations

Supplier Lifecycle Operations