1

The Promise of RL for Autoregressive Image Editing
While image generation techniques are now capable of producing high quality images that respect prompts which span multiple sentences, …
AgentLab Controller: Level Up Your Web Agent with Step-Through Debugging
Recent progress in building computer-using agents has enabled large language models to navigate browser environments and solve complex …
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding
Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models …
Apriel-MTP: Multi-Token Prediction for Faster and More Efficient Language
We introduce multi-token prediction (MTP) variants of the Apriel model family, designed to generate multiple to- kens per forward pass. …
Attack What Matters: Integrating Expert Insight and Automation in Threat-Model-Aligned Red Teaming
Prompt injection attacks target a key vulnerability in modern large language models: their inability to reliably distinguish between …
BigCharts-R1: Enhanced Chart Reasoning With Visual Reinforcement Finetuning
Chart understanding is critical for ServiceNow for data analysis, reason over visualizations, such as interpreting trends, identifying …
ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval
Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, …
DualChronos: Context-Aided Time Series Forecasting with Dual Modalities
The dynamics of complex systems often depend heavily on external context, and natural language is an intuitive medium for practitioners …