Measuring What Matters for Enterprise AI

Subscribe

Home

Conversations On

AI

App Development

CRM

Enterprise IT

Ethics & Governance

Futures

HR

Industries

ServiceNow on ServiceNow

Platform Foundations

Products & Solutions

All topics

For Leaders In

IT & Dev

Customer Experience

Finance, Operations & Strategy

Employee Experience

Security & Risk

News & Events

People & Culture

My List

Explore All

April 9, 2026

5 min

Measuring what matters for enterprise AI

AI

Research

Joyce Li

Principal Product Manager, AI, ServiceNow

Sridhar Nemala

Sr Dir, Machine Learning Engineering, ServiceNow

Nitin Aggarwal

Product Leader, AI, ServiceNow

Top takeaways

Stop relying on generic benchmarks to make enterprise AI deployment decisions.

The biggest gap is planning and judgment, not button-clicking tool execution.

"Safe abstention” is still unreliable and should e treated as a production blocker.

When you spend enough time working at the intersection of AI research and enterprise operations, one thing becomes clear: The tools we use to evaluate AI agents weren't built with the enterprise in mind. They were built for a different era, one of isolated, single-step tasks performed against clean, controlled datasets.

That's not what enterprise work looks like, and it's not what enterprise AI needs to handle.

Enterprise AI demands long-horizon planning capabilities across multiple domains and tools. It also requires persistent state management in interconnected systems, policy and compliance adherence, and reliable error recovery.

That difference matters in practice. When organizations make deployment decisions based on benchmark scores that don't reflect their operational environments, they're not making informed decisions; they're making optimistic ones.

The risk is that AI models could appear capable in benchmarks but fail catastrophically in reality. That's what motivated us to build NOWAI-Bench, with EnterpriseOps-Gym as its first component.

Here’s some important context regarding the findings below: NOWAI-Bench evaluates general-purpose AI models without any platform support—raw capability, without orchestration, guardrails, or workflow intelligence. The results reflect the floor of what's possible, not the ceiling. They’re not a reflection of what a purpose-built enterprise AI platform, including ServiceNow’s offering, can deliver.

NOWAI-Bench evaluates AI agents across 1,150 enterprise tasks.

A purpose-built solution

We believe NOWAI-Bench: EnterpriseOps-Gym is the most comprehensive benchmark to date across enterprise business workflows spanning IT service management (ITSM), customer service management (CSM), and HR. It’s designed specifically to evaluate AI agents against the complexity that exists in production environments.

What makes enterprise workflows genuinely challenging is that they're interconnected. A single IT service request doesn't live in isolation. It may trigger a chain of actions across HR operations, asset management, and customer service.

Data is distributed across dozens—sometimes hundreds—of interdependent tables. Business logic is layered, sequential, and deeply context dependent. Most benchmarks sidestep this entirely. NOWAI-Bench was built to confront it directly.

The benchmark evaluates AI agents across 1,150 real-world enterprise tasks, incorporating 512 functional tools and 164 interconnected database tables. It creates an evaluation environment that reflects how enterprise platforms operate with cross-departmental dependencies, multistep task execution, and contextual reasoning requirements that can't be reduced to a single prompt and response.

The goal was to give enterprises something that’s been conspicuously missing: a reliable, standardized way to assess whether an AI agent is ready for its environment before it's deployed. The tool is grounded not in controlled demonstrations, but in measurable performance against realistic operational complexity.

The goal was to give enterprises something that’s been conspicuously missing: a reliable, standardized way to assess whether an AI agent is ready for its environment before it's deployed.

Key findings

We evaluated a range of AI models (leading proprietary and open-source options alike) across the full EnterpriseOps-Gym benchmark. What we found confirmed what we had suspected from working in this space: Enterprise workflows are significantly more challenging than existing evaluations suggest, and they expose failure modes that general-purpose benchmarks simply don't reveal.

All findings below reflect raw model performance without platform orchestration or guardrails.

Strategic reasoning, not tool use, is the dominant bottleneck

When agents were provided with expert-generated task plans, performance improved by 15% to 35% on the most complicated enterprise domains. Notably, tool selection and execution remained stable even under adversarial conditions.

This tells us that models don't fail at the point of action. They fail at knowing what to do across a constrained, multistep workflow when no one has mapped it out for them. The bottleneck is planning, not execution.

When agents were provided with expert-generated task plans, performance improved by 15% to 35% on the most complicated enterprise domains.

Models don't fail at the point of action. They fail at knowing what to do across a constrained, multistep workflow when no one has mapped it out for them.

Safe abstention remains an unsolved problem

We designed 30 tasks that a well-calibrated enterprise AI agent should simply refuse—requests involving policy violations, missing permissions, or unavailable resources. The highest-performing model correctly identified these as infeasible only about half the time. The failures weren't benign: They frequently resulted in unintended system changes.

In enterprise environments, where policy compliance and data integrity are nonnegotiable, an AI agent that can't reliably say no isn't production ready. This is precisely why purpose-built enterprise AI platforms invest heavily in guardrails and safety layers.

Failure patterns map directly to real business complexity

The ways agents broke down across ITSM, CSM, and HR workflows weren't random; they were structurally predictable. We saw referential integrity violations in HR, service-level agreement mismanagement in ITSM, and entitlement verification failures in CSM.

These failures trace directly to the business rules that enterprise orchestration platforms are explicitly built to enforce. They're not gaps in model intelligence alone; they're gaps that platform-level context and governance exist to close.

ServiceNow's platform advantage

NOWAI-Bench measures raw model capability, what AI agents can do before any platform support is applied. ServiceNow AI Agents, including Now Assist, combine model intelligence with the workflow orchestration of the ServiceNow AI Platform. This entails domain-specific guardrails built from billions of enterprise workflow executions, as well as human-in-the-loop escalations for high-stakes decisions.

In other words, the benchmark quantifies the gap that platforms are built to close. The finding that expert planning improves agent performance by up to 35% directly validates why that platform layer matters.

ServiceNow’s semantic layer that powers the company’s own generative AI tools spans workflow intelligence, knowledge graphs, asset graphs, and access controls. It provides exactly the structured context that raw models lack. It’s the prerequisite for making enterprise AI reliable at scale.

NOWAI-Bench is also designed to benefit the broader AI ecosystem. By establishing a rigorous, enterprise-grounded evaluation standard, it gives model developers a concrete research roadmap and gives enterprise buyers an objective basis for evaluating AI agent capabilities. ServiceNow is using these same insights to guide its own approach to model evaluation and deployment.

Looking ahead

As part of ServiceNow AI Research's broader enterprise AI evaluation initiative, NOWAI-Bench will scale from hundreds of workflow scenarios to thousands of compositional, cross-domain tasks. Multi-agent coordination and voice/multimodal evaluation, including speech-to-action workflows and document understanding, are also on the roadmap as enterprise AI moves well beyond text in, text out.

Core benchmark tasks and the evaluation environment will be released publicly to support reproducibility and community contribution, and enterprise-grounded datasets drawn from real-world workflow patterns will be available through controlled research partnerships. The goal is straightforward: to give enterprises a trusted, transparent way to evaluate AI agents against the operational realities that actually matter.

Acknowledgments

NOWAI-Bench is the result of deep collaboration across ServiceNow. We want to thank the ServiceNow applied AI team (Shiva Malay, Shravan Nayak, Jishnu Nair, Sagar Davasam, Aman Tiwari, Sridhar Nemala, Srinivas Sunkara, and Sai Rajeswar) for the scientific rigor, benchmark design, and evaluation infrastructure that spearheaded this work.

Equal credit goes to the AI Foundations product and engineering teams (Ravi Krishnamurthy, Ganpathy Krishnan, Joyce Li, Raahul Srinivasan, and Nitin Aggarwal), whose real-world perspective on enterprise workflows, model lifecycle management, and responsible AI deployment helped ground this benchmark in operational realities that matter most to our customers.

This is just the beginning. We look forward to working with the research community to push enterprise AI evaluation forward together.

Find out more about ServiceNow AI Research.

References

Paper: https://arxiv.org/abs/2603.13594
Website: https://enterpriseops-gym.github.io/
Dataset: https://huggingface.co/datasets/ServiceNow-AI/EnterpriseOps-Gym
Code:
- https://github.com/ServiceNow/NOWAI-Bench
- https://github.com/ServiceNow/EnterpriseOps-Gym
- https://github.com/ServiceNow/eva

Next up

Dive into more conversations

AI

App Development

CRM

Enterprise IT

Ethics & Governance

Human Resources

Industries

ServiceNow on ServiceNow

Platform Foundations

Products & Solutions

All Topics

Four engaged workers at booths in an office setting

What is Life at ServiceNow?

Find out what it's like to work at a company that puts you first.

Explore

Nicolas Chapados and Siva Reddy also contributed to this content.

Text-embedding models convert a piece of text, such as a search query, document, or piece of code, into a sequence of real-valued numbers. Given such embeddings, we can measure the similarity, or relatedness, of pieces of text. This facilitates various important applications, such as search, clustering, retrieval, and classification.

With the widespread availability of decoder-only large language models (LLMs), such as GPT-4, LLaMA2, Mistral-7B, and StarCoder2, a pressing question in the natural language processing (NLP) research community is how best to use these models to construct powerful text embeddings.

We’re excited to present LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders, a simple and efficient solution to transform any decoder-only LLM into a powerful text encoder in an unsupervised fashion simply by using adapters (LoRA), without the need to modify the base models.

Below we give an overview of the key components of LLM2Vec and present the exciting results we got when benchmarking LLM2Vec models on the challenging Massive Text Embeddings Benchmark (MTEB).

Our LLM2Vec-Mistral ranks first on the MTEB leaderboard in the unsupervised category, first in the supervised category among the models trained on publicly available embedding data (E5), and seventh on the overall leaderboard (the other top six models are trained on synthetic data generated from GPT-4/similar-scale models).

LLM2Vec: “LLMs are secretly powerful text encoders” - Mila, McGill, ServiceNow

LLM2Vec enabling bidirectional attention, masked next token prediction, and unsupervised contrastive learning

A simple and efficient recipe

At its core, LLM2Vec consists of three simple steps:

Enabling bidirectional attention
Adaptation via masked next-token prediction (MNTP)
Adaptation via unsupervised contrastive learning

Adapting a model with the LLM2Vec approach is highly efficient and works with parameter-efficient fine-tuning methods such as LoRA. Additionally, the adaptation can be performed using a general domain corpus such as Wikipedia, requires only a few hundred training steps, and can be run on a single GPU.

State-of-the-art performance

LLM2Vec is not only simple and efficient, but it also leads to state-of-the-art performance on the challenging MTEB, both in the unsupervised and supervised setting (among models trained only on publicly available data).

Unsupervised results

We applied LLM2Vec to some of the best-performing LLMs available and evaluated the resulting text—embedding models on MTEB. In the unsupervised setting—i.e., without using any labeled training data for contrastive learning—our LLM2Vec-transformed models achieved a new state-of-the-art performance of 56.80, outperforming the previous unsupervised approach by a large margin.

Supervised results

LLM2Vec can also be easily combined with supervised contrastive learning. As our results show, applying LLM2Vec before supervised contrastive learning leads to a substantial improvement.

Moreover, LLM2Vec in combination with Mistral-7B, currently the best-performing 7 billion-parameter LLM, leads to a new state-of-the-art performance of 64.80 on MTEB among models trained only with publicly available data.

Highly sample-efficient

LLM2Vec-transformed models require less training data to perform well compared to training models without the LLM2vec transformation.

These results make us particularly excited about challenging real-world scenarios where large amounts of labeled data might be costly to acquire.

Use it on your own data

We’ve made it easy for you to use our LLM2Vec-transformed models. LLM2Vec class is a wrapper on top of Hugging Face models to support sequence encoding and pooling operations. The steps below showcase an example of how to use the library.

Diagrams showing the amount of data needed to train Sheared-LLaMA-1.3B, Llama-2-7b-chat-hf, and Mistral-7B-Instruct-v0.2

Code to initialize the model and apply MNTP-trained LoRA weights on top

Preparing the model

Here, we first initialize the model and apply MNTP-trained LoRA weights on top. After merging the model with MNTP weights, we can either:

Load the unsupervised-trained LoRA weights (trained with SimCSE objective and wiki corpus)
Load the model with supervised-trained LoRA weights (trained with contrastive learning and public E5 data)

Applying LLM2Vec wrapper

Then, we define our LLM2Vec encoder model as follows:

from llm2vec import LLM2Vec

l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)

Inference

This model now returns the text embedding for any input in the form of [[instruction1, text1], [instruction2, text2]] or [text1, text2]. While training, we provide instructions for both sentences in symmetric tasks and only for queries in asymmetric tasks.

Code showing the text returned for any input, for both sentences in symmetric tasks and queries in asymmetric tasks

Summary

As demonstrated above, LLM2Vec is a simple unsupervised approach that can transform any pretrained decoder-only LLM into a strong text encoder.

If you’re as excited about LLM2Vec as we are, check out our hands-on tutorial, which walks you through the different steps of our method. We also welcome contributions on Github and invite the community to share their LLM2Vec-transformed models.

Research: Project page

Code: LLM2Vec on GitHub

Tutorial: Learn how to apply LLM2Vec to LLaMA-2

Find out more about ServiceNow AI Research.

Next up

Dive into more conversations

AI

App Development

CRM

Enterprise IT

Ethics & Governance

Human Resources

Industries

ServiceNow on ServiceNow

Platform Foundations

Products & Solutions

All Topics

Stay in the know

Join Us

Your work email puts us to work

Automotive

Banking

Consumer Packaged Goods

Healthcare

Insurance

Life Sciences

Manufacturing

Nonprofit

National Government

Retail

Technology Providers

Telecom

Find a partner

Become a partner

Partner awards

Partner portal

Partner applications

Careers

Investors

ServiceNow AI Research

Leadership

Locations

Newsroom

Analyst Reports

Global impact

Trust and compliance

ServiceNow Shop

Attendee Portal

Sessions

Sponsors

Get updates

Knowledge 2026 Big Reveals

AI Agents

IT Service Management

ServiceNow AI Control Tower

IT Operations Management

Customer Service Management

Strategic Portfolio Management

IT Asset Management

Governance, Risk, and Compliance

Security Operations

Field Service Management

HR Service Delivery

ServiceNow EmployeeWorks

AI

Data

Workflows

ServiceNow Otto

RaptorDB

Process Mining

AI Agents

ServiceNow AI Control Tower

Security

App Engine

ServiceNow Store

Responsible AI

Provide better experiences

Resolve issues faster

Create and automate workflows

Enterprise Architecture

Service Operations Workspace

Cloud Governance Suite

Operational Technology Management

IT Asset Management

IT Operations Management

IT Service Management

ServiceNow Cloud Observability

Strategic Portfolio Management

Digital End-user Experience

Customer Service Management

Field Service Management

Sales and Order Management

Configure, Price, Quote

Financial Services Operations

Healthcare and Life Sciences Service Management

Sales and Order Management for Technology Providers

Sales and Order Management for Telecommunications

Public Sector Digital Services

Telecommunications Service Management

Technology Provider Service Management

Security Operations

Security Incident Response

Unified Security Exposure Management

Threat Intelligence Security Center

Integrated Risk Management

Third-party Risk Management

Security Posture Control

Privacy Management

Identity Security

HR Service Delivery

Talent Development

Legal Service Delivery

Workplace Service Delivery

Accounts Payable Operations

Sourcing and Procurement Operations

Supplier Lifecycle Operations

ServiceNow EmployeeWorks

Enterprise Service Management

App Engine

Build Agent

Automotive

Banking

Consumer Packaged Goods

Healthcare

Insurance

Life Sciences

Manufacturing

Nonprofit

National Government

Retail

Technology Providers

Telecom