ServiceNow AI Research

Agents

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities
The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent …
No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms
Leading language model (LM) providers like OpenAI and Anthropic allow customers to fine-tune frontier LMs for specific use cases. To …
GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities
The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent …
How to Train Your LLM Web Agent: A Statistical Diagnosis

Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary …

AgentLab Controller: Level Up Your Web Agent with Step-Through Debugging
Recent progress in building computer-using agents has enabled large language models to navigate browser environments and solve complex …
FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering
We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended enterprise deep research tasks. Unlike existing …
Hinting Around: Helping Web Agents Solve Tasks via Hints
While web agents offer an avenue to solve a plethora of tasks due to their ability to navigate the web, they are still brittle and …