In the rapidly evolving field of natural language processing, two prominent approaches have emerged: autoregressive and diffusion language models. Most language models are autoregressive: They learn to predict the next token in view of those that precede it.
In contrast, diffusion models are trained to undo a noising process. Instead of producing tokens in a left-to-right manner, standard diffusion language models generate in a random fashion.
Our Conference on Language Modeling (COLM) 2025 paper, Unifying Autoregressive and Diffusion-Based Sequence Generation, aims to bridge the gap between these two paradigms. By generalizing how diffusion language models generate tokens, we capture autoregressive models as a special case and introduce new techniques to enhance both training and inference efficiency.
Generative diffusion models learn to produce new samples from the same probability distribution as a provided training dataset. One must also specify a noising process—a procedure that gradually destroys the information in a training sample.
For example, given a training dataset of pictures of fruit, the noising process could gradually add Gaussian noise until what clearly was an apple becomes unrecognizable.
Together, the training dataset and noising process specify a stochastic process {Xt}, which we term “training curriculum.”
The key idea is to learn a denoising neural network that undoes the effect of the noising one.
Starting from “full noise,” the denoising network specifies a new stochastic process {X̂t} that approximately reverses the training curriculum {Xt}, yielding a “fresh sample” that wasn’t seen in training.
When text is involved, the training samples are typically represented as sequences of tokens. In discrete diffusion language models, the noising process gradually substitutes some of the original tokens with “noisy” ones until nothing is left of the training sequence.
In the case of the uniform noising process, the “noisy” tokens are sampled uniformly at random among the vocabulary of possible tokens. The denoising process must thus identify which tokens are to be replaced and what should stand in their place. Notice that multiple substitutions may occur at the same position throughout the denoising.
The uniform noising process samples “noisy” tokens uniformly at random.
Another option, the absorb noising process, powers masked diffusion language models (MDLMs). Here, special mask tokens gradually overwrite the original sequence. The denoising process must thus replace mask tokens with non-mask ones. Once a token has been unmasked, however, it cannot be changed again during the denoising.
In the absorb noising process, mask tokens gradually overwrite the original sequence until all tokens are masked out.
Now consider a special version of this absorb noising process where masks are added in a deterministic right-to-left manner. Here the denoising process would uncover these masks in a left-to-right manner, which is exactly what a standard autoregressive language model would do.
Is this still diffusion? Our answer is yes, provided that we expand our understanding of a noising schedule.
A standard autoregressive language model uncovers masks in a left-to-right manner.
A diffusion model's noising process is parametrized by a noising schedule, which governs how quickly the noise is added to the training sample. The concept of a noise schedule applies to both text and images.
A diffusion model’s noising schedules are presented as heat maps from blue (for cool, clean data) to red (for hot, noisy data).
Standard noise schedules apply equally to all parts of the training sample.
Going back to our example of a deterministic right-to-left absorb noising process, we notice that no single schedule can represent it: Each position would need its own schedule. We solve this conundrum by defining “hyperschedules” as position-dependent generalizations of the concept of schedules.
With this generalization, standard autoregressive language models become special diffusion models whose hyperschedule steps from clean to full noise in a single step in a staircase pattern.
Hyperschedules are position-dependent generalizations of the concept of schedules.
Hyperschedules need not have such sharp transitions, which in fact open a great diversity of model design spaces. Our paper pays particular attention to a class of autoregressive-like diffusion models that allow for key-value (KV) caching strategies.
We characterize hyperschedules from that class (see Figure 8) in terms of the:
- Active window width ω—the number of positions that need to be evaluated on each call to the model
- Generation rate ρ—the amortized number of tokens generated per model call
You may notice that autoregressive language models cannot alter previously generated tokens either. However, many of the unique features that make diffusion language models attractive rely on their ability to iteratively improve the generated output, so a noising process allowing this behavior is highly desirable.
To better understand why iterative improvements can be useful, consider the example of a human solving a sudoku puzzle. At many points in the process, they may try something, with the understanding that they may revisit it later. This behavior is forbidden to a model that cannot revisit past choices.
In fact, the whole “think hard” regime with ρ<1 hyperschedules (i.e., the model is called multiple times for each token being produced) is made irrelevant by this restriction.
A less disastrous, but potentially deal-breaking, outcome also faces the “fast generation” ρ>1 regime. When many tokens are produced per model call, these coincident tokens may turn out to be incompatible in retrospect, which is fine if the model can fix them. But a pure “absorb” model cannot.
Uniform-based models can fix their mistakes—in fact, it’s the only move available to them. Why not use those, then? Well, empirical observations reveal that “absorb” confers better overall performances than “uniform,” which we impute to a better “clarity” of the playing field conferred by the model's commitment not to change the token again. Our proposal is to relax this commitment from its previous absolute status by hybridizing the “uniform” and “absorb” processes.
A hybrid noising process combines the “uniform” and “absorb” noising processes.
In fact, our paper defines two novel hybrid noising processes, differing in how they interpolate between absorb and uniform processes. ε-hybrid interpolates the evolution operators, yielding models conceptually closer to MDLM (Sahoo et al., NeurIPS 2024), and γ-hybrid interpolates the transition operators, yielding models conceptually closer to Score Entropy Discrete Diffusion (SEDD; Lou et al., ICML 2024).
In either case, ε and γ are small parameters governing how much “uniform” is blended into the absorb noising process.
Our contributions have enabled significant performance gains and yielded interesting trade-offs.
Our hybrid noising process demonstrates strong zero-shot generalization capabilities across WikiText, Lambada, PubMed, and arXiv.
Our hybrid models consistently outperform prior discrete diffusion approaches and, notably, narrow the performance gap with (and even surpass) autoregressive baselines on several datasets, underscoring their ability to adapt to new tasks without explicit fine-tuning. (Lower is better.)
Our models also establish a new state of the art in balancing sequence generation quality and diversity.
By analyzing generative perplexity against token-level entropy and MAUVE scores, our hybrid configurations consistently achieve better positions on Pareto frontiers compared to existing baselines, indicating significant improvements in fluency, coherence, and diversity.
In addition, our work introduces Adaptive Correction Sampler (ACS), a novel inference algorithm that aims to better use the error-correction capabilities of our hybrid models.
Our empirical results show that integrating ACS significantly enhances our models' capacity to rectify initial errors during the generation process, leading to the generation of demonstrably more coherent and higher-quality sequences. This observation generalizes across most generation rate ρ, typically at the cost of a small drop in entropy.
By unifying the autoregressive and diffusion paradigms, our work provides a flexible and powerful language modeling framework. These two approaches can be seen as a continuum, and our models effectively navigate this spectrum to achieve improved fluency and efficiency.
Despite our positive results, we didn't scale up the networks beyond the 100-million-parameter regime, nor did we fine-tune them to perform practical tasks. Because of this, we're evaluating only how good our models are at modeling language, not their fitness for a more specific purpose. Future work should address these questions while further exploring the great design space revealed by our work.
For a more in-depth understanding of our hyperschedules, hybrid noising processes, ACS, and the specific architectural details, including the attention mechanisms and token handling, we encourage you to read our full paper. A high-level presentation is available on YouTube.
Find out more about ServiceNow AI Research.