Desktop environments remain the blind spot of multimodal-LLM agents: unlike web or mobile, they span heterogeneous software, lack DOM-level APIs, and demand pixel-accurate grounding of UI widgets. We present STARUI, the first large-scale, human-annotated desktop-GUI dataset, 60 k screenshots across 87 professional applications with 700 k element boxes and labels plus an instruction-tuning corpus of 475 k grounding episodes (basic, spatial, and refusal). Leveraging STARUI, we build UI-Vision-Pro, a 3B/7B vision-language model equipped with a lightweight “ALIGN” grounding head and mixed-granularity GRPO training. On three public suites: OS-World-G, ScreenSpot-Pro, and the new UI-Vision benchmark, STARUI 7B achieves up to +28 pp overall accuracy over prior best open models, closing two-thirds of the gap to proprietary giants and proving the value of dense desktop supervision. STARUI lays the empirical foundation for enterprise-grade agents that can read, click, and reason across the full spectrum of desktop GUIs.