Antone King1
ServiceNow Employee

The Queries You’ll Never See

 

How do you know your Virtual Agent or AI Search is working?

The answer is usually some version of test a handful of queries, demo it to stakeholders, move to UAT, and assume our tests are a true representation of what our users will ask…

Do you see the gap this leaves us with? We're making changes to systems that are complicated to systematically evaluate.

Every synonym list update, every knowledge article revision, every change to catalog metadata has downstream effects on what employees find (or don't find) when they search. Without a disciplined way to measure those effects, we're navigating a complex system by intuition alone.

I want to make a case for building a test collection and treating evaluation as an ongoing practice, not a one-time gate.

 

The Query Is Not the Need

A foundational concept worth internalizing in the world of Information Retrieval is when a user types something into a search bar, they're not expressing what they want, they're expressing what they think will retrieve what they want. These are often different things.

A user searching "VPN not working" might need connection troubleshooting steps. Or they might need to know who to contact. Or they might be trying to file a ticket. Or they might want to know if there's a known outage. The query is an artifact of an information need, not the need itself.

This matters because it shapes how we evaluate success. User intent is extremely hard to understand but we need to satisfy that. And users, being human, often hide intent not deliberately, but because articulating an information need precisely is genuinely hard.

Good search experiences are empathetic to this gap and having good evaluation practices can help account for it.

 

Why Enterprise Search Is Different

If you come from web search thinking, enterprise search will surprise you.

On the web, users expect precision e.g. "give me exactly the right result, and I'll click nothing else". The cost of a bad result is low; I'll just try Google again or maybe go to another search engine entirely.

In the enterprise, users are captive. This is the system. There's likely no alternative. That changes the psychology in a couple of ways:

Users will work harder, they'll scan a full page of results if they trust the system might have what they need.

-OR-

Users will give up quietly if they've learned not to trust the search results, they’ll route around it entirely. they'll Slack/Teams a colleague, file a ticket they didn't need to file, or just stay stuck.

The second failure mode is mostly invisible in your metrics, but you can't wait for complaints to surface problems because by the time users complain, they've already developed workarounds.

What we are pursuing is user utility. For enterprise search specifically, that means employee productivity, how much time do people spend looking for information they need? Precision and recall are pretty standard proxies for this.

Precision asks of the results we showed, how many were actually relevant? Recall asks: of all the relevant content that exists, how much did we surface? These two measures trade off against each other. You can always get perfect recall by returning everything but your precision collapses. You can get high precision by returning only the single most confident result but you'll up missing things.

 

Here's where enterprise and web search diverge sharply. On the web, users want high precision at shallow depth give me the right answer in the first three results, I'm not scrolling. In the enterprise, users often need higher recall. They're not just looking for an answer; they're trying to make sure they haven't missed something a policy, a prior decision, a related ticket. They'll tolerate scanning a longer list if they trust the system. But "trust the system" is doing a lot of work in that sentence. Trust is earned through consistent relevance, and you can't manage what you don't measure (yes you have heard that before).

 

Building a Test Collection

A test collection is simply a set of queries paired with judgments about what results should be returned. Relevancy is on a spectrum but to start treat your test collection items as binary (relevant or not relevant).

It's the foundation for measuring whether your search is improving, degrading, or holding steady as you make changes.

Start with real queries. Not hypotheticals, not what you think users should be searching for, actual searches from your environment. Fifty is a reasonable minimum to start seeing patterns. A hundred is better, because you'll want to split them.

Why split? If you tune your system against the same queries you use to evaluate it, you'll end up overfitting. You'll optimize for those specific cases while potentially degrading others. The discipline is:

Development set (50 queries): Use these for tuning. Test your changes here, iterate, adjust.

Held-out test set (50 queries): Use them only for final validation before promoting changes.

By splitting you're acknowledging that your tuning might not generalize. This way your test set becomes a sanity check against those changes. Getting out of UAT then requires reaching a certain threshold on that 100.

This approach dates to the 60s and it’s called the Cranfield paradigm. It's how every serious search team operates, from ServiceNow internal, Google to academic research labs. It’s not complicated you can't improve what you can't measure, and you can't measure without agreed-upon relevance judgments. Most enterprise teams skip this entirely, not because it's complicated, but because it feels like overhead.

 

Where to find real queries in ServiceNow:

Two tables will get you started if you have nothing. The sys_search_signal_result_event table captures search queries alongside the documents users clicked. You can add the rank column, group on the description column and sort desc. The sn_ais_assist_qna_log table captures Now Assist synthesized response based on a user query, the generated response, and the sources used. Neither is perfect. Click data tells you what users settled for. QnA logs tell you what summary they received. But neither tells you what they actually needed. Still, if you have nothing, this is your starting point.

Pull queries with meaningful click activity by grouping and sorting by the document field. For each one, ask does this click represent a genuinely relevant result for that query? If yes, you have a labeled sample. If no, you've found a gap worth investigating. Either way, you have a real query tied to a real information need.

A caveat on click data. A click on result five might mean "this was the best option." It might also mean "I gave up on the first four and grabbed something close enough." Users satisfice, they pick the first result that seems adequate, not the optimal one. This is why click-through rate is not a great metric in isolation.

 

Diagnosing Where Things Break

Once you have a test collection, you can start evaluating systematically. Run your queries, review the results, and for each failure, ask where this broke down and label the results in some failure category.

These tend to land into three groups:

Retrieval failure: The right document exists in your knowledge base, but the system didn't surface it.

Synthesis failure: The right document was retrieved, but the generated answer was wrong, incomplete, or misleading.

Coverage gap: The right document simply doesn't exist. No amount of tuning will fix missing knowledge by the way.

Labeling failures this way makes them actionable. Start with these three; you can add more (reranker failures, latency issues) as your diagnostic practice matures.

 

 

The Deeper Point

I opened by asking how you know if your search is working. The honest answer is you don't know with certainty. You can't.

What you can do is build systems that make your uncertainty explicit and your learning ongoing. This test collection isn't proof of correctness but a way to surface blind spots, and force honest conversations about what "working" even means for your organization.

The teams that I see struggle the most are the ones who treat evaluation as a gate to pass rather than a practice. They validate once, ship, and move on until something breaks visibly enough to demand attention.

Search is a living system. Content changes, user needs shift, and the questions people ask to evolve. For instance, conversational search is a whole new thing for users to experience, and they will engage with interesting and unexpected behavior.

There's one more thing worth sitting with. Search systems shape what users search for. If your search is bad at certain queries, users learn to avoid them. They adapt their language to what works. Over time, your query logs start to reflect what the system can do, not what users need. This is the feedback loop problem and it's invisible in your metrics. The queries that would reveal your biggest gaps never get asked, because users have already given up on them. This is why you can't purely rely on behavioral data. Sometimes you must go talk to people. Ask what they couldn't find, not just observe what they searched for.

 

If you're starting from zero pull 100 real queries from sys_search_signal_result_event. Label them for relevance. Split them 50/50. Measure your baseline. Then change one thing, measure again, and see what moves.

 

That's not everything. But it's a foundation you can build on.

 

 

Version history
Last update:
2 hours ago
Updated by:
Contributors