- Post History
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
2 hours ago - edited 2 hours ago
Behind the AI Agent
Prompts, Data & Control
← Back to Masterclass Overview
- Context Engineering: Providing the right information at the right time – too much context = noise, too little = hallucinations
- Memory Types: Short-term (conversation), Long-term (preferences), and Episodic (specific events) memory
- Golden Datasets: Systematic evaluation with representative test cases and expected outcomes
- Metrics First: Define KPIs BEFORE building, not after launch – both business and quality metrics
From "Prompt Guessing" to "Agent Engineering"
Many teams treat AI agent development like guesswork – trying random prompts, hoping for good results, with no systematic way to measure or improve. This session changes that.
- Trial-and-error prompting
- No defined metrics
- Can't prove value
- Don't know when to stop
- Systematic engineering
- Defined metrics upfront
- Structured testing
- Continuous optimization
Key Insight: "We need to move away from guessing at prompts to systematic engineering of AI agents. Without metrics, you can't prove value or know when to stop."
The AI Agent Factory Framework
Think of agent development as a factory with distinct stations. Each station has specific inputs, processes, and outputs:
Context Engineering
Context Engineering is the art of providing the AI agent with exactly the right information at the right time to make informed decisions.
The Challenge: LLMs have limited context windows. Too much context creates noise and slower responses. Too little context leads to poor decisions and hallucinations. The solution? Dynamically assemble only relevant context.
- Filter aggressively: Only include information relevant to the current task
- Prioritize by relevance: Most important context first, within token limits
- Use structured formats: JSON, tables, or clear sections help the LLM parse
- Test context combinations: Different contexts produce different results
Memory Types in ServiceNow
ServiceNow provides three types of memory for AI Agents, each serving a different purpose:
Metrics: Business vs. Quality
Successful AI agents require both business and quality metrics – and they must be defined BEFORE you start building:
- ROI / Cost Savings
- Time Saved
- User Satisfaction (CSAT)
- Deflection Rate
- Tickets Resolved
- Accuracy / Precision
- Response Time
- Error Rate
- Completion Rate
- Hallucination Rate
Critical Rule: "KPIs must be defined BEFORE building, not after launch. If you don't know what success looks like, you can't measure it."
Evaluation with Golden Datasets
Golden Datasets are curated collections of test cases with known expected outcomes. They're essential for systematic agent evaluation:
| Step | Action |
|---|---|
| 1 | Create Golden Dataset – Representative test cases with expected outcomes |
| 2 | Select Agent – Choose which agent configuration to evaluate |
| 3 | Define Metrics – Accuracy, Relevance, Helpfulness, Safety |
| 4 | Run Evaluation – Execute all test cases systematically |
| 5 | Analyze Results – Identify patterns, failures, opportunities |
| 6 | Iterate – Improve agent, re-test, compare results |
Pro Tip: Your Golden Dataset should include edge cases, not just happy paths. Include examples where you expect the agent to fail gracefully, ask clarifying questions, or escalate to humans.
Knowledge Graphs: Connected Context
Knowledge Graphs represent connected knowledge as nodes and edges, showing relationships between entities. This enables context-aware queries that go beyond simple keyword matching:
--has_role--> [Role: Developer]
--uses--> [System: ServiceNow]
--reported--> [Incident: INC0012345]
Benefits for AI Agents:
- Better understanding of relationships between data
- More precise, contextually grounded answers
- Fewer hallucinations due to explicit relationship constraints
- Define Your Metrics: Before building any agent, define what success looks like
- Create a Golden Dataset: Start with 20-30 representative test cases for your use case
- Explore Agentic Evaluations: Check out the resources below to set up systematic testing
- Complete the Journey: Join us for Session 5 on Data, Scale & Governance
Last updated: January 2026
