About
People
Publications
Open Source
Demos
Events
Blog
Careers
Contact
English
English
Français
ServiceNow
ServiceNow Research
Tags
Multi-modal Learning
ServiceNow Research
Multi-modal Learning
BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning
Charts are essential to data analysis, transforming raw data into clear visual representations that support human decision-making. …
Ahmed Masry
,
Abhay Puri
,
Masoud Hashemi
,
Juan A. Rodriguez
,
Megh Thakkar
,
Khyati Mahajan
,
Vikas Yadav
,
Sathwik Tejaswi Madhusudhan
,
Alexandre Piche
,
Dzmitry Bahdanau
,
Christopher Pal
,
David Vazquez
,
Enamul Hoque Prince
,
Perouz Taslakian
,
Sai Rajeswar Mudumba
,
Spandana Gella
Conference on Language Modeling (COLM), 2025.
Cite
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Developing autonomous agents that can navigate diverse Graphical User Interfaces (GUIs) and solve complex tasks is essential for …
Shravan Nayak
,
Xiangru Jian
,
Kevin Lin
,
Juan A. Rodriguez
,
Motek Kalsi
,
Nicolas Chapados
,
Tamer Özsu
,
Aishwarya Agrawal
,
David Vazquez
,
Christopher Pal
,
Perouz Taslakian
,
Spandana Gella
,
Sai Rajeswar Mudumba
International Conference on Machine Learning (ICML), 2025.
PDF
Cite
Code
Video
StarVector: Generating Scalable Vector Graphics Code from Images and Text
Scalable Vector Graphics (SVGs) are vital for modern image rendering due to their scalability and versatility. Previous SVG generation …
Juan A. Rodriguez
,
Abhay Puri
,
Shubham Agarwal
,
Issam H. Laradji
,
Pau Rodriguez
,
Sai Rajeswar Mudumba
,
David Vazquez
,
Christopher Pal
,
Marco Pedersoli
Computer Vision and Pattern Recognition (CVPR), 2025.
PDF
Cite
Code
Video
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models …
Ahmed Masry
,
Juan A. Rodriguez
,
Tianyu Zhang
,
Suyuchen Wang
,
Chao Wang
,
Aarash Feizi
,
Akshay Kalkunte
,
Abhay Puri
,
Xiangru Jian
,
Pierre-André Noël
,
Sathwik Madhusudhan
,
Marco Pedersoli
,
Bang Liu
,
Nicolas Chapados
,
Yoshua Bengio
,
Enamul Hoque Prince
,
Christopher Pal
,
Issam H. Laradji
,
David Vazquez
,
Perouz Taslakian
,
Spandana Gella
,
Sai Rajeswar Mudumba
Workshop at the International Conference of Learning Representation (ICLR), 2025.
PDF
Cite
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding …
Juan A. Rodriguez
,
Xiangru Jian
,
Siba Smarak Panigrahi
,
Tianyu Zhang
,
Aarash Feizi
,
Abhay Puri
,
Akshay Kalkunte
,
Francois Savard
,
Ahmed Masry
,
Shravan Nayak
,
Rabiul Awal
,
Mahsa Massoud
,
Amirhossein Abaskohi
,
Zichao Li
,
Suyuchen Wang
,
Pierre-André Noël
,
Mats L. Richter
,
Saverio Vadacchino
,
Shubham Agarwal
,
Sanket Biswas
,
Sara Shanian
,
Ying Zhang
,
Sathwik Tejaswi Madhusudhan
,
João Monteiro
,
Krishnamurthy (Dj) Dvijotham
,
Torsten Scholak
,
Nicolas Chapados
,
Sepideh Kharaghani
,
Sean Hughes
,
Tamer Özsu
,
Siva Reddy
,
Marco Pedersoli
,
Yoshua Bengio
,
Christopher Pal
,
Issam H. Laradji
,
Spandana Gella
,
Perouz Taslakian
,
David Vazquez
,
Sai Rajeswar Mudumba
International Conference of Learning Representations (ICLR), 2025.
PDF
Cite
Code
Video
VCR: Visual Caption Restoration
We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially …
Tianyu Zhang
,
Suyuchen Wang
,
Lu Li
,
Ge Zhang
,
Perouz Taslakian
,
Sai Rajeswar Mudumba
,
Jie Fu
,
Bang Liu
,
Yoshua Bengio
International Conference of Learning Representations (ICLR), 2025.
PDF
Cite
Code
StarVector: Generating Scalable Vector Graphics Code from Images and Text
Scalable Vector Graphics (SVGs) are vital for modern image rendering due to their scalability and versatility. Previous SVG generation …
Juan A. Rodriguez
,
Abhay Puri
,
Shubham Agarwal
,
Issam H. Laradji
,
Pau Rodriguez
,
Sai Rajeswar Mudumba
,
David Vazquez
,
Christopher Pal
,
Marco Pedersoli
AAAI Demos, 2025.
PDF
Cite
Video
BigDocs: A Permissively-Licensed Dataset for Training Vision-Language Models on Document and Code Tasks
Vision and language models that can accurately understand both images and text are crucial for deeper document understanding. These …
Juan A. Rodriguez
,
Xiangru Jian
,
Siba Smarak Panigrahi
,
Tianyu Zhang
,
Aarash Feizi
,
Abhay Puri
,
Akshay Kalkunte
,
Francois Savard
,
Amirhossein Abaskohi
,
Ahmed Masry
,
Shravan Nayak
,
Mahsa Massoud
,
Rabiul Awal
,
Pierre-André Noël
,
Mats L. Richter
,
Saverio Vadacchino
,
Shubham Agarwal
,
Sanket Biswas
,
Ying Zhang
,
Sathwik Tejaswi Madhusudhan
,
João Monteiro
,
Krishnamurthy (Dj) Dvijotham
,
Torsten Scholak
,
Nicolas Chapados
,
Sean Hughes
,
Tamer Özsu
,
Aishwarya Agrawal
,
Marco Pedersoli
,
Christopher Pal
,
Perouz Taslakian
,
David Vazquez
,
Issam H. Laradji
,
Spandana Gella
,
Sai Rajeswar Mudumba
Workshop at the Neural Information Processing Systems (NeurIPS), 2024.
PDF
Cite
Code
Video
Multimodal foundation world models for generalist embodied agents
Learning generalist agents, able to solve multitudes of tasks in different domains is a long-standing problem. Reinforcement learning …
Pietro Mazzaglia
,
Tim Verbelen
,
Bart Dhoedt
,
Aaron Courville
,
Sai Rajeswar Mudumba
Neural Information Processing Systems (NeurIPS), 2024.
PDF
Cite
Code
Representing Positional Information in Generative World Models for Object Manipulation
The ability to predict outcomes of interactions between embodied agents and objects is paramount in the robotic setting. While …
Stefano Ferraro
,
Pietro Mazzaglia
,
Tim Verbelen
,
Sai Rajeswar Mudumba
Workshop at the Neural Information Processing Systems (NeurIPS), 2024.
PDF
Cite
»
Cite
×