Frequently asked questions about agentic evaluations
Summarize
Summarized using AI
This content was generated using new OpenAI-powered functionality. Results are provided on an as is basis and are not guaranteed to be accurate or complete.
Summary of Frequently asked questions about agentic evaluations
This content provides essential guidance for ServiceNow customers on setting up, running, and interpreting agentic evaluations within AI Agent Studio. It answers common questions related to preparation, dataset creation, metric customization, and result analysis to ensure effective validation of AI agents and workflows.
Show less
Preparation Before Automated Evaluation
- Test your agent or workflow in the playground to catch obvious issues early.
- Ensure the data table includes all required inputs if using generated or previous scenario logs.
- Prepare at least 100 scenarios to create a robust evaluation.
- Define clear success criteria for expected agent outputs.
Setting Up and Running Evaluations
- Follow the guided flow: select the agent/workflow version, choose built-in or custom metrics, and use or create a dataset.
- Datasets consist of execution logs capturing agent or workflow processing of records such as incidents or tasks.
- Datasets can be built from past run logs or new logs generated after setup.
- Evaluations run asynchronously; track progress from the homepage without staying on the evaluation page.
Custom Metrics
- Create custom metrics if unique criteria are needed beyond built-in options, such as detecting specific phrases or measuring response length.
- Custom metrics require naming, defining scope (workflow, agents, or both), specifying measurement logic, inputs, and scripting the metric.
- The parser tool assists in extracting structured data from execution logs, simplifying custom metric creation without manual parsing of XML or JSON.
Interpreting and Using Evaluation Results
- Each execution receives scores per selected metric; consult the Metric guide to understand score meanings.
- Metric thresholds can be customized to match organizational definitions of success and failure.
- Review results to identify configuration gaps, assess readiness for deployment, and analyze performance issues.
- Use detailed execution and metric score insights to refine configurations in AI Agent Studio and rerun evaluations to measure improvements.
Find answers to common questions about setting up and running evaluations.
- Do I need to keep anything ready before an automated evaluation?
- Before you begin, make sure you:
- Test your agent or workflow in the playground. Catch the obvious issues early—automated evaluations are best for deeper validation.
- Ensure your table has all the required inputs if you're generating test scenarios or using scenarios from previous agent or workflow runs during setup.
- Prep enough scenarios. We recommend at least 100. Your evaluation is only as strong as the situations you put your agent through.
- Define what success means. Be clear on what the right output for your agent should be.
- How do I set up my first automated evaluation?
- To set up an evaluation, follow the guided flow:
- Select your agent or workflow and its version.
- Choose your metrics—built-in or custom.
- Use an existing dataset or decide how you want to build one.
- When should I create a custom metric?
- Create a custom metric when you have unique evaluation criteria and want to measure workflow or agent-specific behaviors that aren't covered by ServiceNow's built in metrics. For example, you might want to:
- Check whether a particular phrase appears in the agent's response.
- Measure response length to assess verbosity or brevity.
- How do I build a dataset for agentic evaluations?
- There are two ways to build a dataset for agentic evaluations, but first, let's clarify what a dataset is. Your dataset should include logs of executions that capture what happens when your AI agent or workflow processes records like incidents, case, or tasks. You can create a dataset by either:
- Using logs from previous agent or workflow runs, or
- Generating new logs by running the agent or workflow after setup.
- What's next after an automated evaluation?
- Review your evaluation results to:
- Identify configuration gaps in your agent or workflow
- Assess deployment readiness
- Analyze tool performance for issues with inputs or descriptions
- Drill down into individual executions and metric scores
- How do I create a custom metric?
- Create a custom metric in a few steps:
- Name and describe your metric.
- Define its evaluation scope—agentic workflow, agents, or both.
- Specify what it measures, how it works, and its output format.
- Add metric inputs and write your script-based metric.
- Save and publish to make it available for use.
- How do I interpret evaluation results?
- Based on the metrics you select, each execution will display a score for every metric. Refer to the "Metric guide" to understand what the scores mean. You can also customize metric thresholds to align with your organization's definitions of success and failure.
- How do I track the progress of my evaluations?
- Evaluations may take some time, but you don't need to stay on the page. From the homepage, you can track all evaluations and even see if any action is required.
- How is the parser tool used during custom metric creation?
- When creating a custom metric for agentic evaluations, providing a metric input is optional—we include the 'execution plan record sys_id' by default. We also provide a parser tool that pulls structured data from your execution logs, so you won't need to manually parse through the XML or JSON. You can access the parser tool's outputs with tool output.