Explore

Zurich Enable AI

Release

zurich

ft:locale

en-US

ft:publication_title

Zurich Enable AI

ft:clusterId

platai

bundleId

platai

workflow

Platform

Explore agentic evaluations

Release version: Zurich

Updated March 18, 2026

2 minutes to read

Summarize

Summarized using AI

Summary of Explore agentic evaluations

Agentic evaluations automate the testing of agentic AI assets, such as AI agents and agentic workflows, to determine their readiness for production deployment. These evaluations provide objective, explainable evidence of quality by running the AI asset against defined datasets and employing LLM-powered judges to score performance metrics like task completeness, response accuracy, and tool use. This process helps remove guesswork from quality assurance and accelerates iterative development by verifying AI asset quality in a structured manner before production deployment.

Show full answer Show less

Evaluations can be performed in non-production environments during the testing phase to ensure the AI asset meets benchmarks and standards without the need for live deployment.

Users

Agent builders: Developers or configurators who create AI agents in AI Agent Studio and run scalable, rigorous evaluations.
Platform administrators: Govern agent approval for production and use evaluation results as quality evidence.
AI leads and architects: Use evaluation results for audit trails and quality metrics across multiple agents.

Workflow

The evaluation process involves:

Configuring an evaluation run by specifying the AI asset version, metrics, and dataset.
Executing the evaluation and monitoring progress as LLM judges assess responses.
Analyzing results, including scores and identified issues.
Applying system-generated optimization recommendations.
Triggering re-evaluations to confirm improvements and validate quality over time.

Benefits

Version-specific quality assessment: Evaluate particular versions of AI assets to ensure production readiness.
Custom standards: Define and measure custom metrics for AI response and performance quality.
Real-time tracking: Monitor evaluation progress with in-progress results.
Issue identification and tracing: Detect problems and trace them back to their source for targeted fixes.
Optimization guidance: Receive automatic recommendations to improve AI assets based on evaluation outcomes.

Next Steps

To effectively configure and use agentic evaluations, explore the detailed guidance on evaluating agentic AI assets and the reference materials provided for agentic evaluations.

Automated evaluations test your agentic AI assets and help determine when they're ready for production. Learn more about how evaluations work, who they’re designed for, and the benefits they deliver.

Agentic evaluations overview

Automated agentic evaluations help give AI agent builders the confidence to deploy with objective, explainable evidence that their agents are ready for production. They remove the guesswork from quality assurance by running your agent against a defined dataset and applying LLM-powered judges to score quality, such as task completeness, response accuracy, and tool use. From there, the system generates recommended optimizations you can apply before triggering a re-evaluation to confirm improvements.

Building agentic AI assets like AI agents and agentic workflows is an iterative process. Agentic evaluations are designed to verify the quality of the AI asset with in a structured way to help speed up the process. Because you're testing against representative datasets, you can be more confident in the performance of your agentic AI asset to handle real-world situations.

Agentic evaluations can be run in non-production environments and don't require live deployment. They can be run during testing phases of agentic AI assets to help ensure that they can be deployed to a production environment while meeting your benchmarks and standards.

Agentic evaluations users

Table 1. Users
User	Description
Agent builder	Developer or configurator who builds agents in AI Agent Studio. Automated evaluations are designed so agent builders can run rigorous evaluations at scale.
Platform administrators	Platform administrators who govern which agents are approved for production can use automated evaluation results for evidence of quality before deployment.
AI leads and architects	AI leads and architects can use automated evaluation results for audit trails and quality metrics across multiple agents.

Automated evaluations workflow

Configure an evaluation run with a name, selected agentic AI asset and its version, metrics, and dataset.
Execute the run and track progress as the LLM judges agentic responses.
Analyze the run results, including the judge scores and identified issues and traces.
Optimize the agentic AI asset with targeted recommendations, then trigger reevaluations.
Validate the quality of future runs or other changes to the agentic AI asset.

Automated evaluations benefits

Table 2. Automated evaluations benefits
Benefit	Feature	Users
Evaluate specific versions of agentic AI assets for quality	Execute an evaluation run	Agent builders
Set your own standards for agentic AI responses and performance	Custom metrics	Agent builders, Platform administrators, AI leads, and architects
Track evaluations as they progress	In-progress results	Agent builders
Identify issues and trace them back to the source	Evaluation outputs	Agent builders, AI leads, AI architects
Optimize agentic AI assets based on evaluation results	System-generated optimization recommendations	Agent builders

What to explore next

To learn more about configuring and using agentic evaluations, see: