ServiceNow AI Research

Agents

MMTEB: Massive Multilingual Text Embedding Benchmark

Text embeddings are typically evaluated on a narrow set of tasks, limited in terms of languages, domains, and task types. To circumvent …

LitLLMs, LLMs for Literature Review: Are We There Yet?

Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write, …

StarVector: Generating Scalable Vector Graphics Code from Images and Text
Scalable Vector Graphics (SVGs) are vital for modern image rendering due to their scalability and versatility. Previous SVG generation …
Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models
Abstention Ability (AA) is a critical aspect of Large Language Model (LLM) reliability, referring to an LLM’s capability to …
The BrowserGym Ecosystem for Web Agent Research
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those …
AgentMerge: Enhancing Generalization in Fine-Tuned LLM Agents
Recent advancements in large language models (LLMs) have spurred interest in developing autonomous agents capable of performing complex …
Fine-Tuning Web Agents: It Works, But It's Trickier Than You Think
Recent advancements in large language models (LLMs) have sparked interest in developing autonomous web agents capable of performing …
Multimodal foundation world models for generalist embodied agents
Learning generalist agents, able to solve multitudes of tasks in different domains is a long-standing problem. Reinforcement learning …
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data …