1

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
Understanding diverse web data and automating web development presents an exciting challenge for agentic multimodal models. While …
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code …
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks
While numerous recent benchmarks focus on evaluating generic Vision-Language Models (VLMs), they fall short in addressing the unique …
Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs
Forecasting in real-world settings requires models to integrate not only historical data but also relevant contextual information, …
BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation

Neural sentence embedding models for dense retrieval typically rely on binary relevance labels, treating query-document pairs as …

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats
We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a …
Exploring Sparse Adapters for Scalable Merging of Parameter Efficient Experts
Merging parameter-efficient task experts has recently gained growing attention as a way to build modular architectures that can be …
Unifying Autoregressive and Diffusion-Based Sequence Generation
We present significant extensions to diffusion-based sequence generation models, blurring the line with autoregressive language models. …
Fine-Tune an SLM or Prompt an LLM? The Case of Generating Low-Code Workflows
Large Language Models (LLMs) such as GPT-4o can handle a wide range of complex tasks with the right prompt. As per token costs are …