Unifying autoregressive and diffusion language models

AI-generated image showing an orange cat between AR models and diffusion models

Nima Fathi, Torsten Scholak, and Pierre-André Noël authored this blog post. The image was generated using AI.

In the rapidly evolving field of natural language processing, two prominent approaches have emerged: autoregressive and diffusion language models. Most language models are autoregressive: They learn to predict the next token in view of those that precede it.

In contrast, diffusion models are trained to undo a noising process. Instead of producing tokens in a left-to-right manner, standard diffusion language models generate in a random fashion.

Our Conference on Language Modeling (COLM) 2025 paper, Unifying Autoregressive and Diffusion-Based Sequence Generation, aims to bridge the gap between these two paradigms. By generalizing how diffusion language models generate tokens, we capture autoregressive models as a special case and introduce new techniques to enhance both training and inference efficiency.

Diffusion models

Generative diffusion models learn to produce new samples from the same probability distribution as a provided training dataset. One must also specify a noising process—a procedure that gradually destroys the information in a training sample.

For example, given a training dataset of pictures of fruit, the noising process could gradually add Gaussian noise until what clearly was an apple becomes unrecognizable (see Figure 1).

Noising process: Training Data + Noising Process = Training Curriculum Figure 1: Together, the training dataset and noising process specify a stochastic process {Xt}, which we term “training curriculum.”

The key idea is to learn a denoising neural network that undoes the effect of the noising one (see Figure 2).

Denoising process: Starting from “full noise,” the denoising network specifies a new stochastic process {X̂t} that approximately reverses the training curriculum {Xt}, yielding a “fresh sample.” Figure 2: Starting from “full noise,” the denoising network specifies a new stochastic process {t} that approximately reverses the training curriculum {Xt}, yielding a “fresh sample” that wasn’t seen in training.

Language models

When text is involved, the training samples are typically represented as sequences of tokens. In discrete diffusion language models, the noising process gradually substitutes some of the original tokens with “noisy” ones until nothing is left of the training sequence.

In the case of the uniform noising process, the “noisy” tokens are sampled uniformly at random among the vocabulary of possible tokens (see Figure 3). The denoising process must thus identify which tokens are to be replaced and what should stand in their place. Notice that multiple substitutions may occur at the same position throughout the denoising.

The uniform noising process samples “noisy” tokens uniformly at random. Figure 3: The uniform noising process samples “noisy” tokens uniformly at random.

Another option, the absorb noising process, powers masked diffusion language models (MDLMs; see Figure 4). Here, special mask tokens gradually overwrite the original sequence. The denoising process must thus replace mask tokens with non-mask ones. Once a token has been unmasked, however, it cannot be changed again during the denoising.

Absorb noising process: Mask tokens gradually overwrite the original sequence until all tokens are masked out. Figure 4: In the absorb noising process, mask tokens gradually overwrite the original sequence until all tokens are masked out.

Now consider a special version of this absorb noising process where masks are added in a deterministic right-to-left manner. Here the denoising process would uncover these masks in a left-to-right manner, which is exactly what a standard autoregressive language model would do (see Figure 5).

Is this still diffusion? Our answer is yes, provided that we expand our understanding of a noising schedule.

A standard autoregressive language model uncovers masks in a left-to-right manner. Figure 5: A standard autoregressive language model uncovers masks in a left-to-right manner.

Hyperschedule

A diffusion model's noising process is parametrized by a noising schedule, which governs how quickly the noise is added to the training sample (see Figure 6). The concept of a noise schedule applies to both text and images.

A diffusion model’s noising schedules are presented as heat maps from blue (for cool, clean data) to red (for hot, noisy data). Figure 6: A diffusion model’s noising schedules are presented as heat maps from blue (for cool, clean data) to red (for hot, noisy data).

Standard noise schedules apply equally to all parts of the training sample.

Going back to our example of a deterministic right-to-left absorb noising process, we notice that no single schedule can represent it: Each position would need its own schedule. We solve this conundrum by defining “hyperschedules” as position-dependent generalizations of the concept of schedules (see Figure 7).

With this generalization, standard autoregressive language models become special diffusion models whose hyperschedule steps from clean to full noise in a single step in a staircase pattern.

Hyperschedules are position-dependent generalizations of the concept of schedules. Figure 7: Hyperschedules are position-dependent generalizations of the concept of schedules.

Hyperschedules need not have such sharp transitions, which in fact open a great diversity of model design spaces. Our paper pays particular attention to a class of autoregressive-like diffusion models that allow for key-value (KV) caching strategies.

We characterize hyperschedules from that class (see Figure 8) in terms of the:

An example of a hyperschedule characterized by ω=3 and ρ=1 Figure 8: An example of a hyperschedule characterized by ω=3 and ρ=1

Hybrid processes

Another aspect of our work grants mask-based models the ability to fix their own mistakes. As mentioned above, when using the absorb noising process, the corresponding denoising process replaces mask tokens with non-mask ones, where already-unmasked tokens cannot be changed again.

You may notice that autoregressive language models cannot alter previously generated tokens either. However, many of the unique features that make diffusion language models attractive rely on their ability to iteratively improve the generated output, so a noising process allowing this behavior is highly desirable.

To better understand why iterative improvements can be useful, consider the example of a human solving a sudoku puzzle. At many points in the process, they may try something, with the understanding that they may revisit it later. This behavior is forbidden to a model that cannot revisit past choices.

In fact, the whole “think hard” regime with ρ<1 hyperschedules (i.e., the model is called multiple times for each token being produced) is made irrelevant by this restriction.

A less disastrous, but potentially deal-breaking, outcome also faces the “fast generation” ρ>1 regime. When many tokens are produced per model call, these coincident tokens may turn out to be incompatible in retrospect, which is fine if the model can fix them. But a pure “absorb” model cannot.

Uniform-based models can fix their mistakes—in fact, it’s the only move available to them. Why not use those, then? Well, empirical observations reveal that “absorb” confers better overall performances than “uniform,” which we impute to a better “clarity” of the playing field conferred by the model's commitment not to change the token again. Our proposal is to relax this commitment from its previous absolute status by hybridizing the “uniform” and “absorb” processes (see Figure 9).

A hybrid noising process combines the “uniform” and “absorb” noising processes. Figure 9: A hybrid noising process combines the “uniform” and “absorb” noising processes.

In fact, our paper defines two novel hybrid noising processes, differing in how they interpolate between absorb and uniform processes. ε-hybrid interpolates the evolution operators, yielding models conceptually closer to MDLM (Sahoo et al., NeurIPS 2024), and γ-hybrid interpolates the transition operators, yielding models conceptually closer to Score Entropy Discrete Diffusion (SEDD; Lou et al., ICML 2024).

In either case, ε and γ are small parameters governing how much “uniform” is blended into the absorb noising process.

Results

Our contributions have enabled significant performance gains and yielded interesting trade-offs (see Table 1).

Hybrid noising process results: strong zero-shot generalization capabilities across WikiText, Lambada, PubMed, and arXiv Table 1: Our hybrid noising process demonstrates strong zero-shot generalization capabilities across WikiText, Lambada, PubMed, and arXiv.

Our hybrid models consistently outperform prior discrete diffusion approaches and, notably, narrow the performance gap with (and even surpass) autoregressive baselines on several datasets, underscoring their ability to adapt to new tasks without explicit fine-tuning. (Lower is better.)

Our models also establish a new state of the art in balancing sequence generation quality and diversity (see Figure 10).

Graph showing how hybrid configurations consistently achieve better positions on Pareto frontiers compared to existing baselines

Figure 10: By analyzing generative perplexity against token-level entropy and MAUVE scores, our hybrid configurations consistently achieve better positions on Pareto frontiers compared to existing baselines, indicating significant improvements in fluency, coherence, and diversity.

In addition, our work introduces Adaptive Correction Sampler (ACS), a novel inference algorithm that aims to better use the error-correction capabilities of our hybrid models (see Table 2).

Table showing our empirical results that integrating ACS significantly enhances our models' capacity to rectify initial errors during the generation process Table 2: Our empirical results show that integrating ACS significantly enhances our models' capacity to rectify initial errors during the generation process, leading to the generation of demonstrably more coherent and higher-quality sequences. This observation generalizes across most generation rate ρ, typically at the cost of a small drop in entropy.

Conclusion

By unifying the autoregressive and diffusion paradigms, our work provides a flexible and powerful language modeling framework. These two approaches can be seen as a continuum, and our models effectively navigate this spectrum to achieve improved fluency and efficiency.

Despite our positive results, we didn't scale up the networks beyond the 100-million-parameter regime, nor did we fine-tune them to perform practical tasks. Because of this, we're evaluating only how good our models are at modeling language, not their fitness for a more specific purpose. Future work should address these questions while further exploring the great design space revealed by our work.

For a more in-depth understanding of our hyperschedules, hybrid noising processes, ACS, and the specific architectural details, including the attention mechanisms and token handling, we encourage you to read our full paper. A high-level presentation is available on YouTube.

Find out more about ServiceNow AI Research.