ServiceNow Research

Multi-modal Learning

VCR: Visual Caption Restoration
We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially …
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks
The ability of large language models (LLMs) to mimic human-like intelligence has led to a surge in LLM-based autonomous agents. Though …
Multimodal foundation world models for generalist embodied agents
Learning generalist embodied agents, able to solve multitudes of tasks in different domains is a long-standing problem. Reinforcement …
InCoRo: In-Context Learning for Robotics Control with Feedback Loops
One of the challenges in robotics is to enable robotic units with the reasoning capability that would be robust enough to execute …
StarVector: Generating Scalable Vector Graphics Code from Images and Text
Scalable Vector Graphics (SVGs) have become integral in modern image rendering and graphic design applications due to their infinite …
Are Diffusion Models Vision-And-Language Reasoners?
Text-conditioned image generation models have recently shown immense qualitative success using denoising diffusion processes. However, …
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. …
FigGen: Text to Scientific Figure Generation
The generative modeling landscape has experienced tremendous growth in recent years, particularly in generating natural images and art. …
Haptics-based Curiosity for Sparse-reward Tasks
Robots in many real-world settings have access to force/torque sensors in their gripper and tactile sensing is often necessary in tasks …
Adaptive Cross-Modal Few-shot Learning
Metric-based meta-learning techniques have successfully been applied to few-shot classification problems. In this paper, we propose to …