About
People
Publications
Open Source
Demos
Events
Blog
Careers
Contact
English
English
Français
ServiceNow
ServiceNow AI Research
Tags
Multi-modal Learning
ServiceNow AI Research
Multi-modal Learning
Grounding Computer Use Agents on Human Demonstrations
Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen …
Aarash Feizi
,
Shravan Nayak
,
Xiangru Jian
,
Kevin Qinghong Lin
,
Kaixin Li
,
Rabiul Awal
,
Xing Han Lu
,
Johan Obando
,
Juan A. Rodriguez
,
Nicolas Chapados
,
David Vazquez
,
Adriana Romero Soriano
,
Reihaneh Rabbany
,
Perouz Taslakian
,
Christopher Pal
,
Spandana Gella
,
Sai Rajeswar Mudumba
International Conference on Learning Representations, 2026.
PDF
Cite
StarFlow: Generating Structured Workflow Outputs From Sketch Images
Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and …
Patrice Béchard
,
Chao Wang
,
Juan A. Rodriguez
,
Amirhossein Abaskohi
,
Christopher Pal
,
David Vazquez
,
Spandana Gella
,
Sai Rajeswar Mudumba
,
Perouz Taslakian
European Chapter of the Association for Computational Linguistics (EACL), 2026.
PDF
Cite
Video
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding
Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models …
Ahmed Masry
,
Juan A. Rodriguez
,
Tianyu Zhang
,
Suyuchen Wang
,
Chao Wang
,
Aarash Feizi
,
Akshay Kalkunte
,
Abhay Puri
,
Xiangru Jian
,
Pierre-André Noël
,
Sathwik Madhusudhan
,
Marco Pedersoli
,
Bang Liu
,
Nicolas Chapados
,
Yoshua Bengio
,
Enamul Hoque Prince
,
Christopher Pal
,
Issam H. Laradji
,
David Vazquez
,
Perouz Taslakian
,
Spandana Gella
,
Sai Rajeswar Mudumba
Neural Information Processing Systems (NeurIPS), 2025.
PDF
Cite
Video
Rendering-Aware Reinforcement Learning for Vector Graphics Generation
Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in …
Juan A. Rodriguez
,
Haotian Zhang
,
Abhay Puri
,
Rishav Pramanik
,
Aarash Feizi
,
Pascal Wichmann
,
Arnab Mondal
,
Mohammad Reza Samsami
,
Rabiul Awal
,
Perouz Taslakian
,
Spandana Gella
,
Sai Rajeswar Mudumba
,
David Vazquez
,
Christopher Pal
,
Marco Pedersoli
Neural Information Processing Systems (NeurIPS), 2025.
PDF
Cite
The Promise of RL for Autoregressive Image Editing
While image generation techniques are now capable of producing high quality images that respect prompts which span multiple sentences, …
Saba Ahmadi
,
Rabiul Awal
,
Ankur Sikarwar
,
Amirhossein Kazemnejad
,
Ge Ya Luo
,
Juan A. Rodriguez
,
Sai Rajeswar Mudumba
,
Siva Reddy
,
Christopher Pal
,
Benno Krojer
,
Aishwarya Agrawal
Neural Information Processing Systems (NeurIPS), 2025.
PDF
Cite
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding
Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models …
Perouz Taslakian
,
Sai Rajeswar Mudumba
,
Spandana Gella
,
Ahmed Masry
,
Tianyu Zhang
,
Juan A. Rodriguez
,
Chao Wang
,
Abhay Puri
,
Xiangru Jian
,
Pierre-André Noël
,
Issam H. Laradji
NOW AI, 2025.
Cite
BigCharts-R1: Enhanced Chart Reasoning With Visual Reinforcement Finetuning
Chart understanding is critical for ServiceNow for data analysis, reason over visualizations, such as interpreting trends, identifying …
Sai Rajeswar Mudumba
,
Perouz Taslakian
,
Ahmed Masry
,
David Vazquez
,
Christopher Pal
,
Abhay Puri
,
Megh Thakkar
,
Masoud Hashemi
,
Khyati Mahajan
,
Spandana Gella
NOW AI, 2025.
Cite
ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval
Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, …
Ahmed Masry
,
Megh Thakkar
,
Patrice Béchard
,
Sathwik Madhusudhan
,
Rabiul Awal
,
Shambhavi Mishra
,
Akshay Kalkunte
,
Enamul Hoque Prince
,
Spandana Gella
,
Torsten Scholak
,
Sai Rajeswar Mudumba
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025.
PDF
Cite
StarUI: Learning to Ground Agentic Perception in Desktop GUIs
Desktop environments remain the blind spot of multimodal-LLM agents: unlike web or mobile, they span heterogeneous software, lack …
Aarash Feizi
,
Shravan Nayak
,
Kevin Qinghong Lin
,
Kaixin Li
,
Rabiul Awal
,
Xiangru Jian
,
Juan A. Rodriguez
,
Nicolas Chapados
,
David Vazquez
,
Reihaneh Rabbany
,
Adriana Romero Soriano
,
Perouz Taslakian
,
Christopher Pal
,
Spandana Gella
,
Sai Rajeswar Mudumba
NOW AI, 2025.
Cite
StarVLM ReRank: Better UI Grounding via Enhanced Visual Input and Element Position Perception
UI grounding is a fundamental task for enterprise workflow automation. This task maps natural language instructions to precise pixel …
Suyuchen Wang
,
Tianyu Zhang
,
Ahmed Masry
,
Christopher Pal
,
Bang Liu
,
Perouz Taslakian
,
Spandana Gella
NOW AI, 2025.
Cite
»
Cite
×