Efficient Inference

We introduce multi-token prediction (MTP) variants of the Apriel model family, designed to generate multiple to- kens per forward pass. …

NOW AI, 2025.

Large Language Models achieve their success through transformer architectures with attention mechanisms that compute token …

NOW AI, 2025.

Retrieval-Augmented Generation (RAG) has become ubiquitous when deploying Large Language Models (LLMs), as it can address typical …

Patrice Béchard, Orlando Marquez

Knowledge Discovery and Data Mining, 2025.

Graph databases like Neo4j are gaining popularity for handling complex, interconnected data, over traditional relational databases in …

North American Chapter of the Association for Computational Linguistics (NAACL), 2025.

We take significant steps toward unifying autoregressive and diffusion-based sequence generation by extending the SEDD discrete …

Nima Fathi, Torsten Scholak, Pierre-André Noël

Workshop at the International Conference of Learning Representation (ICLR), 2025.

In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference …

Workshop at the Neural Information Processing Systems (NeurIPS), 2024.

We present a simple meta quantization approach that quantizes different layers of a large language model (LLM) at different bit levels, …

ArXiv, 2024.