1

Centering Knowledge Along the Responsible LLM Supply Chain: An Empirical Study & Multi-Stakeholder Taxonomy
Framing LLMs as products of complex supply chains rather than monolithic entities facilitates the creation of nuanced approaches to …
Privileged Information Distillation for Language Models
Training-time privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a …
GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities
The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent …
Societal Alignment Frameworks Can Improve LLM Alignment
Recent progress in large language models (LLMs) has focused on producing responses that meet human expectations and align with shared …
Augmenting LLM Reasoning with Dynamic Notes Writing for Complex QA
Iterative RAG for multi-hop question answering faces challenges with lengthy contexts and the buildup of irrelevant information. This …
DRBench: A Realistic Benchmark for Enterprise Deep Research
We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike …
Grounding Computer Use Agents on Human Demonstrations
Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen …
No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms
Leading language model (LM) providers like OpenAI and Anthropic allow customers to fine-tune frontier LMs for specific use cases. To …
Learning a Spatial Partitioning and its Causal Relations from Temporal Data
Scientific research often seeks to understand the causal structure underlying high-level variables in a system. For example, climate …
StarFlow: Generating Structured Workflow Outputs From Sketch Images
Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and …