Your MCP Server is Live — Now Make It Actually Work

BSHarshaK · 13m ago

A practical framework for ServiceNow AI architects who want their MCP agents to reason correctly in production — not just connect.

The failure nobody puts in the demo

The demo worked perfectly. Two tools, one test prompt, clean result. You showed the team, marked the PoC complete, and started planning production rollout. Three weeks later the agent is failing. Not occasionally — consistently. Confidently picking the wrong tool, passing the wrong values, creating records nobody asked for.

You check the integration. You check the Skill. You check the OAuth flow. Everything is fine.

The problem was four words in a tool description.

Working infrastructure is the starting line, not the finish line. The layer that determines whether your agent reasons correctly in production — under real user language, with multiple tools, at scale — lives entirely in the text you write around that infrastructure. This article is about that layer.

What Context Engineering is

Context Engineering is the discipline of designing the information you give an AI agent so it can reason correctly — not just execute.

Every piece of text the agent sees — tool descriptions, input schemas, agent roles, Skill instructions — is an instruction to the model. The model reasons over all of it simultaneously. If that text is ambiguous, overlapping, or incomplete, the agent will make confident, plausible, wrong decisions.

In the MCP context specifically, this matters for one reason: the agent selects tools entirely based on their descriptions. It cannot inspect the underlying Skill logic. It cannot see the GlideRecord script. It reads the description you wrote and makes a bet. If your description is imprecise, your agent is imprecise — regardless of how solid the underlying implementation is.

⚙️ The Three Text Layers

In ServiceNow's MCP implementation, the text that governs agent behaviour lives in three separate places, each doing a different job.

Layer	Where It Lives	What It Controls
Tool description	Tool record in MCP Server Console	Tool selection — which tool the agent calls
Skill instructions	Now Assist Skill record	Skill execution — how the tool performs its action
Agent role / system prompt	Agent configuration	Agent posture — overall reasoning style and cross-tool behaviour

The five tests below apply specifically to the Tool description layer — the description field on the Tool record. This is the text the agent reads when deciding which tool to call. Execution logic belongs in Skill instructions. Cross-tool behavioural guidance belongs in the agent role. Putting logic in the wrong layer creates conflicts that are hard to debug because everything appears to be working correctly in isolation.

🔺 The Five Tests

Before any MCP tool goes to production, run its description through these five tests.

Test 1: The Disambiguation Test

Question: Could the agent tell this tool apart from every other tool on the server — including under ambiguous user language?

Two tools whose descriptions share key trigger words will cause the agent to guess whenever user input lands in the overlap. A Create Incident tool and an Update Incident tool that both use the word "open" as a trigger will occasionally conflict. The agent picks confidently. It picks wrong.

The fix: Add explicit negative scope. Do not just describe what the tool does — describe what it does not do, specifically in relation to the other tools on the same server.

"Do NOT use this tool if the user references an existing incident number (INC...) or asks to modify, update, resolve, or close a record that already exists."

This is harmless with one tool on the server. It is essential the moment you add a second.

Multi-tool note: Disambiguation compounds with every tool you add. A standard ITSM MCP implementation — Create Incident, Update Incident, Search Incidents, Create Change, Create Service Request — requires each tool to explicitly rule out the others' trigger territory. Write the negative scope for every tool as a set, not individually. When you add a new tool, revisit the existing descriptions.

Test 2: The Enumeration Test

Question: For every enumerated value in your schema, does the description tell the agent exactly what to pass — including the semantic mapping?

A schema that says priority (1-5) is incomplete. The agent knows the range but not the meaning. When a user says "this is critical" or "low priority, whenever you can", the agent needs to map natural language to an integer. Without the mapping, different models will map differently, and the same model will be inconsistent across sessions.

The fix: Always spell it out.

"priority: 1=critical, 2=high, 3=moderate, 4=low, 5=planning"

The same logic applies to impact, urgency, and any numeric scale with semantics beyond its range. If a human would need to look up what "3" means, so does the agent.

Test 3: The Default Behavior Test

Question: When optional inputs are absent, does the agent know what will happen — and is that safe for your production instance?

Two failure modes exist here. The first: undocumented defaults, where the agent omits an optional field and a platform default creates a record nobody intended. The second: fields that are optional in the schema but mandatory on your instance — the insert fails silently and the error is hard to trace.

The fix: Document defaults explicitly. Flag fields where absence is risky. What is safe in a sandbox is not always safe in production.

"Category: ask the user if not specified — do not infer or leave blank, as this field may be mandatory on the target instance."

Test 4: The Output Contract Test

Question: Does the tool description tell the agent what it will receive back — and what to do with it?

Output behavior is usually handled in the agent role. The problem with that: tool descriptions should be self-contained. If the tool is reused in a different agent, or if the agent role is updated without revisiting every tool, the output contract becomes undefined. Agents that receive structured output without instructions will paraphrase, summarize, or decide on their own whether to surface it — inconsistently.

The fix: Add one line covering both parts of the contract.

"On success, confirm to the user with the incident number and deep-link URL. On error, surface the error reason verbatim."

Test 5: The Failure Mode Test

Question: Does the agent know what failure looks like — and that it should not try to recover by fabricating inputs?

Skills that return structured error objects are doing the right thing. But if the tool description does not tell the agent what failure looks like, some models will treat an error object as a partial success and attempt to fill in missing inputs themselves.

The fix:

"If the tool returns an error object, surface the reason field to the user verbatim. Do not retry with inferred or fabricated values."

THE DEDOF FRAMEWORK

Five tests. Together they form the DEDOF framework.

Run every MCP tool description through these before you ship.

D — Disambiguation: can the agent tell this tool apart from every other?

E — Enumeration: does every value have a semantic mapping?

D — Default Behavior: does the agent know what happens when inputs are absent?

O — Output Contract: does the agent know what it receives back — and what to do with it?

F — Failure Mode: does the agent know what failure looks like — and not to fabricate a recovery?

✅ Quick validation on a live example

Here is a typical Create Incident tool description — functional, reasonable, and not quite production-ready:

"Use this tool to create a new incident in ServiceNow. Inputs: short_description (required), priority (1-5), category, urgency (1-3), impact (1-3). Returns the incident number and a link to the record."

In ServiceNow, that description lives in the Tool record under MCP Server Console — not in the Now Assist Skill itself.

To set up the Now Assist Skill and MCP Tool used in this example, follow the runbook "Implementing the Model Context Protocol in ServiceNow — A Practical Guide".

Run it through the five tests:

Test	Result
Disambiguation	Passes for a single-tool server. Needs an explicit INC-number boundary the moment Update Incident is added.
Enumeration	Priority has a range but no semantic mapping. Impact and urgency have the same gap — 1=high, 2=medium, 3=low is missing for both.
Default behavior	Category default is not documented — on instances where category is mandatory, this causes silent insert failures.
Output contract	Return values are listed. What the agent should do with them is absent — that instruction belongs here, not only in the agent role.
Failure mode	No instruction on what to do if the tool returns an error object. Some models will attempt to recover by fabricating missing inputs.

Four of five tests surface something worth tightening. That is typical. Even carefully written descriptions leave reasoning gaps that only appear under production conditions.

The revised tool description

Here is what the same Create Incident description looks like after applying the five tests:

Creates a net-new Incident record on ServiceNow.
 
Use ONLY when the user wants to log, create, or report a new incident.
Do NOT use if the user references an existing incident number (INC...)
or asks to update, modify, resolve, or close an existing record.
 
Required: short_description (one-line summary).
 
Optional:
- description: free-form detail
- priority: 1=critical, 2=high, 3=moderate, 4=low, 5=planning (default: 3)
- category: ask the user if not specified — do not infer or leave blank
- impact: 1=high, 2=medium, 3=low
- urgency: 1=high, 2=medium, 3=low
 
On success: confirm to the user with the incident number and deep-link URL.
On error: surface the reason verbatim. Do not retry or fabricate inputs.

Same infrastructure. Materially better reasoning surface.

The practical exercise

If you have a working MCP server with at least one published tool, open the Tool record, pull the description field, and run the five tests right now.

Tests 1 and 2 will surface something immediately. Test 3 depends on your instance configuration. Tests 4 and 5 will each have at least one gap.

Fix those before you add a second tool. The disambiguation problem is harmless with one tool and compounds with every tool you add after it.

💡 Three takeaways

Working infrastructure is the starting line, not the finish line — and Context Engineering is what closes the gap. A tool that executes is not the same as a tool an agent reasons over reliably. The gap lives in the text you write, not the code you deploy.
Tool descriptions are the agent's entire decision surface for tool selection. Write them as if they are the only thing standing between correct behavior and a confident wrong answer — because they are.
Run the five tests before you go to production. Disambiguation, enumeration, default behavior, output contract, failure modes. Five questions, fifteen minutes, one fewer class of production failures.

BEFORE YOU SHIP

Before you ship your next MCP tool — ask yourself: did you DEDOF it?

Builds on Yogesh Shinde's "Implementing the Model Context Protocol in ServiceNow — A Practical Guide" and Vyoma Gajjar's "MCP: The Protocol Powering Agentic AI".