StarUI: Learning to Ground Agentic Perception in Desktop GUIs

Aarash Feizi, Shravan Nayak, Kevin Qinghong Lin, Kaixin Li, Rabiul Awal, Xiangru Jian, Juan A. Rodriguez, Nicolas Chapados, David Vazquez, Reihaneh Rabbany, Adriana Romero Soriano, Perouz Taslakian, Christopher Pal, Spandana Gella, Sai Rajeswar Mudumba

November 2025

Abstract

Desktop environments remain the blind spot of multimodal-LLM agents: unlike web or mobile, they span heterogeneous software, lack DOM-level APIs, and demand pixel-accurate grounding of UI widgets. We present STARUI, the first large-scale, human-annotated desktop-GUI dataset, 60 k screenshots across 87 professional applications with 700 k element boxes and labels plus an instruction-tuning corpus of 475 k grounding episodes (basic, spatial, and refusal). Leveraging STARUI, we build UI-Vision-Pro, a 3B/7B vision-language model equipped with a lightweight “ALIGN” grounding head and mixed-granularity GRPO training. On three public suites: OS-World-G, ScreenSpot-Pro, and the new UI-Vision benchmark, STARUI 7B achieves up to +28 pp overall accuracy over prior best open models, closing two-thirds of the gap to proprietary giants and proving the value of dense desktop supervision. STARUI lays the empirical foundation for enterprise-grade agents that can read, click, and reason across the full spectrum of desktop GUIs.

Type

Conference paper

Publication

NOW AI