ServiceNow AI Research

StarUI: Learning to Ground Agentic Perception in Desktop GUIs

Abstract

Desktop environments remain the blind spot of multimodal-LLM agents: unlike web or mobile, they span heterogeneous software, lack DOM-level APIs, and demand pixel-accurate grounding of UI widgets. We present STARUI, the first large-scale, human-annotated desktop-GUI dataset, 60 k screenshots across 87 professional applications with 700 k element boxes and labels plus an instruction-tuning corpus of 475 k grounding episodes (basic, spatial, and refusal). Leveraging STARUI, we build UI-Vision-Pro, a 3B/7B vision-language model equipped with a lightweight “ALIGN” grounding head and mixed-granularity GRPO training. On three public suites: OS-World-G, ScreenSpot-Pro, and the new UI-Vision benchmark, STARUI 7B achieves up to +28 pp overall accuracy over prior best open models, closing two-thirds of the gap to proprietary giants and proving the value of dense desktop supervision. STARUI lays the empirical foundation for enterprise-grade agents that can read, click, and reason across the full spectrum of desktop GUIs.

Publication
NOW AI
Shravan Nayak
Shravan Nayak
Visiting Researcher

Visiting Researcher at Frontier AI Research located at Montreal, QC, Canada.

David Vazquez
David Vazquez
Director of AI Research

Director of AI Research at AI Research Management located at Montreal, QC, Canada.

Perouz Taslakian
Perouz Taslakian
Research Lead

Research Lead at Frontier AI Research located at Montreal, QC, Canada.

Christopher Pal
Christopher Pal
Distinguished Scientist

Distinguished Scientist at AI Research Partnerships & Ecosystem​ located at Montreal, QC, Canada.

Spandana Gella
Spandana Gella
Research Manager

Research Manager at Frontier AI Research located at Montreal, QC, Canada.