Evaluation flow for batch evaluations

Yokohama Enable AI

Release

yokohama

ft:locale

en-US

ft:publication_title

Yokohama Enable AI

ft:clusterId

platai

bundleId

platai

workflow

Platform

Evaluation flow for batch evaluations

Release version: Yokohama

Updated September 3, 2025

4 minutes to read

Summarize

Summarized using AI

Summary of Evaluation Flow for Batch Evaluations

Batch evaluation enables ServiceNow Eval administrators to assess up to 100 completed virtual agent conversations simultaneously using a saved query. This process automates the creation of evaluation records and invokes Now Assist skills asynchronously for each eligible conversation, applying the same logic used for single-conversation evaluations but at scale. The flow enforces HR scope exclusions, validates topics and categories, constructs detailed transcripts, excludes early live-agent interactions, and scores conversations through designated skills.

Show full answer Show less

Key Features

Trigger and Inputs: The evaluation starts when an Evaluation Set record’s state changes to “In Progress” with type “Conversation” and includes a query filter targeting conversations.
Random Sampling: The flow executes the query and randomly selects up to 100 conversations, excluding any previously evaluated ones.
HR Scope Exclusion: Conversations within HR application scopes are automatically excluded to maintain compliance.
Transcript Construction: Builds a minute-level transcript by aggregating messages, tagging user and virtual agent texts, and including referenced Knowledge articles and catalog items with specific annotations and truncation rules.
Live Agent Exclusions: Conversations requesting or invoking live agents early in the interaction are excluded from evaluation.
Chat Topic Classification: Uses the Chat Topic Classifier skill to validate evaluation eligibility and determine the conversation’s topic and category (e.g., IT or HR).
Evaluation Record Creation: Creates or updates evaluation records with detailed metadata such as topic, category, transcript references, and first live agent occurrence.
Now Assist Skill Invocation: Asynchronously invokes multiple evaluation skills (e.g., Coherence, Conciseness, Context Retention, Intent Accuracy, Smooth Flow, Truthfulness) per conversation, capturing scores, rationales, and supporting examples in evaluation metrics.
Rate Limiting: Includes a wait period between skill invocations to manage API rate limits.

Practical Outcomes for ServiceNow Customers

Enables efficient, scalable quality assessment of virtual agent conversations, facilitating performance analysis and continuous improvement.
Automates exclusion of ineligible conversations ensuring compliance with HR policies and evaluation relevance.
Provides comprehensive transcripts enriched with Knowledge Base and catalog content for accurate evaluation context.
Delivers structured evaluation metrics with detailed scoring and rationale, supporting data-driven decisions on virtual agent effectiveness.
Simplifies management of large conversation datasets by random sampling and duplication checks, ensuring evaluation efforts focus on unique, relevant interactions.

Batch evaluation enables Eval admins to evaluate up to 100 completed virtual agent conversations at once, based on a saved query.

Flow name: Execute Batch Evaluation.

The flow creates evaluation records and invokes Now Assist skills for each eligible conversation, mirroring the single-conversation evaluation logic, but at scale. It enforces HR scope exclusions, topic/category validation, transcript construction rules, early live-agent exclusions, and asynchronous scoring through skills.

Batch evaluations are performed using the following logic:

Trigger

Table: Evaluation set [sn_na_conv_eval_evaluation_set]
Condition: State changes to In Progress and Evaluation type = Conversation

Inputs

Evaluation Set record with:
- Query filter: A query that targets conversations to be evaluated (for example, sys_cs_conversation filters).
- Evaluation type: Conversation
- State: In Progress (to start)
LLM/Skills: Chat Topic Classifier, plus the evaluation skills listed after this.

High-level behavior

Reads the query filter and randomly samples up to 100 conversations.
Skips already-evaluated conversations.
Excludes HR-scoped interactions.
Uses Chat Topic Classifier to validate evaluation eligibility and extracts Topic and Category.
Builds a transcript with controlled inclusion of Knowledge articles and catalog sources, and applies early live agent exclusions.
Creates an Evaluation record and asynchronously invokes all selected evaluation skills, writing scores and rationale to metrics.

Sequence of execution:

Action 1: If the query filter isn’t empty

Purpose: Guard clause.
Logic: Look up the Evaluation Set record and check the query filter field.
If the query filter is present: Proceed to Action 2.
If empty: Stop and optionally log No query provided.

Action 2: Randomize conversations

Purpose: Select a bounded, random sample of conversations from the provided query.
Logic:
- Execute the query to get matching conversation records.
- Randomly select up to 100 conversations.
  - If >100 matches, cap at 100.
  - If <100, select all.
- Validate the query; if invalid, return false and an empty or partial array.
Outputs:
- success: true/false
- conversation_ids: array of sys_ids (max 100)
If success = true: Proceed to Action 3; otherwise, stop and log the validation error.

Action 3: Look up the evaluation table to check prior evaluation

Purpose: Avoid duplicate evaluations.
Logic: For each conversation sys_id, check sn_na_conv_eval_evaluation for existing records indicating that it's already evaluated or is in progress (implementation choice: state not in canceled/failed).
If not previously evaluated: Proceed to Action 4 for that conversation.
If already evaluated: Skip this conversation, optionally log Already evaluated.

Action 4: Look up the interaction record

Purpose: Enforce HR scope exclusion.
Logic: Resolve the interaction related to the conversation. If its application scope contains hr, skip the conversation.
If the scope doesn’t contain hr: Proceed to Action 5.

Action 5: buildTranscript

Purpose: Construct the final, minute-level transcript and determine downstream skill set and guardrails.
Steps:
- Aggregate all conversation messages.
- Tag user messages as [User]: and virtual agent messages as [Virtual Agent]:.
- Knowledge articles:
  - If genius results reference Knowledge articles, query the Knowledge article and replace the genius snippet with the entire article body.
  - Annotate with [Virtual Agent]: Help articles for user query: and wrap content within Article_Start and Article_End.
  - Constraints:
    - If the KB is HR-scoped or inaccessible, don't evaluate (skip conversation).
    - Truncate the article body to a maximum of 10,000 words.
    - If the KB content source is attached files (PDF/Word/Txt), fall back to the genius result instead of full file content.
- Catalog Items:
  - If genius results reference catalog items, query sc_cat_item and build a string: catalog name, short description, description.
  - Annotate with [Virtual Agent]: Please choose one of the below options: and include citation order.
- Live Agent Exclusions:
  - If the first user message requests a live agent, skip evaluation.
  - If a live agent is invoked within the first 120 words, skip evaluation.
Outputs:
- ExecuteEvaluation: true/false (post-guardrail outcome)
- Chat transcript
- Knowledge articles referred
- Catalog items referred
- First live agent occurrence: Sys_id of the conversation message (if present)
- Skills to invoke:
  - Coherence Chat Evaluation
  - Conciseness Chat Eval
  - Context Retention
  - Inadequate Slot Filling Chat Eval
  - Intent Accuracy Chat Eval
  - Smooth Flowing Conversation Chat Eval
  - Truthfulness Hallucination Chat Eval
- Additional logs
If ExecuteEvaluation = true: Proceed to Action 7; otherwise, skip the conversation.

Action 6: If Block

Purpose: Branch to record creation.
Logic: If ExecuteEvaluation from Action 6 is true, go to Action 8.

Action 7: Chat Classifier Eval

Purpose: Validate whether the conversation should be evaluated and extract high-level labels.
Logic:
- Build a lightweight transcript from sys_cs_message for classification input.
- Invoke Chat topic classifier skill with the transcript.
- Receive:
  - Execute evaluation: true/false
  - Topic Name
  - Category: IT or HR
If Execute evaluation = true: Proceed to Action 6.
If false: Skip conversation and log the classifier decision.

Action 8: Create or Update evaluation record

Purpose: Persist an evaluation entry for this conversation.
Table: sn_na_conv_eval_evaluation
Field population:
- Document conversation: Conversation reference
- State: processing
- Topic: from Action 5
- Category: from Action 5
- KB Referred: from Action 6
- Catalog Referred: from Action 6
- First live agent occurrence: from Action 6
- Type: chat summarization
- User: initiating user for the conversation
- Message log: Additional logs from Action 6
On success: Proceed to Action 9.

Action 9: For Loop over skills

Purpose: Execute each selected evaluation skill.
For each skill in the list from Action 6:
- Action 10: invokeApiDefinition
  - Inputs: Skill Name, Conversation, Transcript, Evaluation Id
  - Behavior:
    - Invoke the Now Assist skill asynchronously.
    - The post-processor writes results into sys_generative_ai_response_validator.
    - Extract JSON response fields:
      - Score
      - Reason for Score
      - Examples supporting the reasoning
    - Create child metric records in sn_na_conv_eval_evaluation_metrics linked to the parent evaluation.
- Action 11: Wait
  Pause seven seconds before proceeding to the next skill to manage rate limits or throttling.