Evaluation flow

  • Release version: Yokohama
  • Updated September 2, 2025
  • 3 minutes to read
  • Summarize
    Summarized using AI
    This content was generated using new OpenAI-powered functionality. Results are provided on an as is basis and are not guaranteed to be accurate or complete.

    Summary of Evaluation flow

    The Evaluation flow automates the process of evaluating virtual agent conversations after they are completed. It captures all user interactions, triggers evaluations based on specific conditions, and generates detailed transcripts and evaluation records. This workflow applies to conversations across channels such as Web Client, Slack, Teams, Bot to Bot, and Messenger, focusing on sampling 10% of eligible conversations for evaluation.

    Show full answer Show less

    Evaluation Execution Process

    • Conversation Capture: All interactions are logged in the syscsconversation table. When a conversation ends, its state updates to Complete, triggering the evaluation flow.
    • Sampling and Limits: A daily maximum evaluation count is enforced, and only 10% of conversations are randomly sampled for evaluation to optimize resource usage.
    • Channel and Scope Filters: Only conversations from supported channels are evaluated. Conversations involving HR application scope or referencing HR/inaccessible knowledge articles are excluded.
    • Transcript Construction: Builds a detailed transcript tagging user and virtual agent messages, includes full knowledge article content (with limits), and catalog item details. Conversations involving early live agent interactions are skipped.
    • Chat Classification: Uses a chat topic classifier to determine whether to proceed with evaluation and assigns a topic category such as IT or HR.
    • Evaluation Records: Creates or updates records in the snnaconvevalevaluation table with conversation details, topics, knowledge references, and evaluation status.
    • Skill-Based Evaluation: For each flagged evaluation skill, the flow asynchronously invokes the Now Assist Skill API, parses results, and stores scores, reasoning, and examples in snnaconvevalevaluationmetrics.

    Special Behaviors and Edge Cases

    • Sampling limits evaluation to 10% of conversations to balance load and insight generation.
    • Excludes conversations routed to live agents at the start or within the first 120 words to focus on virtual agent performance.
    • Knowledge article content is truncated at 10,000 words; attached files use summarized results.
    • Applies specific logic to combine scores and reasons from slot filling and intent evaluation for accurate scoring.

    What Customers Can Expect

    This flow enables ServiceNow customers to systematically evaluate virtual agent interactions with automated sampling, filtering, and detailed transcript generation. It ensures evaluations are focused on relevant conversations, excludes HR-sensitive content, and provides granular scoring with reasoning for continuous virtual agent improvement. Customers can monitor evaluation results through generated records and metrics, enhancing their ability to optimize virtual agent performance effectively.

    The workflow for evaluation execution, which performs evaluations when conversations are completed.

    Conversations are evaluated using the following logic:
    1. Conversation capture:

      All end-user interactions with the virtual agent are logged in the Conversation table [sys_cs_conversation]. When a user ends the conversation, the record's state is updated to Complete.

    2. Automated flow evaluation trigger:

      Flow name: Execute Evaluation.

      Trigger condition:
      • Table: Conversation table [sys_cs_conversation]
      • State: Complete
      • Device type: Web Client, Slack, Teams, Bot to Bot, Messenger

    Sequence of execution:

    Action 0: Check evaluations count for today
    • Perform a query on evaluation table and to get record count.
    • If record count is less than Max Number of evaluations per day, continue to Action 1, else end flow.
    Action 1: evalExecuteCondition
    • Invokes the evalExecuteCondition.executeEvaluation script Include with conversation reference.
    • Generates a random number (1–100). Proceeds only if ≤10 (10% random sampling).
    • Outcome: Returns true or false for further processing.
    Action 2: Conditional Branch
    • If true: Proceed to the next action.
    • If false: Evaluation stops.

    Action 3: Lookup Interaction Table:

    Matches the conversation's channel metadata with the interaction table to fetch related records.

    Action 4: Application Scope Filter:

    If the interaction's application scope doesn’t include hr, continue.

    Action 5: buildTranscript:

    Detailed Transcript Construction:
    • Tags: [User]: For user messages, [Virtual Agent]: For virtual agent messages.
    • For any referenced Knowledge article:
      • Pulls the complete article body to replace genius result, tagged with [Virtual Agent]: Help articles for user query: and delimited by Article_Start/Article_End.
      • If the Knowledge article is in HR scope/inaccessible, skip evaluation.
      • If the Knowledge article content is >10,000 words: Truncate at 10,000.
      • Attached files (PDF/Word/Txt): Use genius result instead.
    • For referenced Catalog Items:

      Extracts name, short description, description, annotated as [Virtual Agent]: Please choose one of the below options: with citation number.

    • If the first message is to the live agent, or the live agent is invoked within the first 120 words: skip evaluation.
    Outputs:
    • ExecuteEvaluation (true/false)
    • Chat Transcript
    • Knowledge articles or catalog items referred
    • Sys_id of first live agent invocation (if any)
    • List of skills to invoke (all evaluation skills for Evaluation dashboard)
    • Additional evaluation logs

    Action 6: Conditional Branch:

    If ExecuteEvaluation is true: Continue to Action 7.

    Action 7: Chat Classifier Eval
    • Builds the initial transcript from sys_cs_message.
    • Uses Chat topic classifier to determine:
      • Should the conversation be evaluated? (ExecuteEvaluation: true/false)
      • Topic Name
      • Category (IT/HR)
    • If ExecuteEvaluation is true: Proceed to Action 6.

    Action 8: Create or Update Evaluation Record:

    Create a record on Evaluation [sn_na_conv_eval_evaluation] table with:
    • Document Conversation: Conversation reference
    • State: Processing
    • Topic, Category, Knowledge article or catalog references, first live agent sys_id, type, user who initiated, message log

    Action 9: For each skill:

    Repeats for each skill flagged in Action 6.

    Action 10: invokeApiDefinition
    • Inputs: Skill name, conversation, transcript, evaluation id
    • Calls Now Assist Skill API asynchronously.
    • Post processing available in sys_generative_ai_response_validator, performs the following parsing:
      • Score
      • Reason for Score
      • Examples for the reasoning
    • Parsed data is created on the Evaluation Metrics [sn_na_conv_eval_evaluation_metrics] table (Score, Reasons, Examples, and the entire reasoning for scoring [Scratchpad]).

    Action 11: Waits 7 seconds before continuing to the next skill.

    Special behavior and edge case handling:
    • Sampling: Only 10% of conversations (randomly chosen) are evaluated.
    • Channel Filter: Only Web, Slack, Teams, Bot to Bot, Messenger.
    • Application Scope: Excludes records with _hr_ in the scope.
    • Knowledge article controls: No evaluation for HR or inaccessible. Knowledge articles, limits on Knowledge article size, and file handling.
    • First live agent invocation: Excludes conversations routed to the live agent at the start or within 120 words.
    • The Request Completion skill is added as part of a business rule where the score is tagged as the lowest between Slot filling and Intent.
    • The reason on the record is added as follows:
      if (Slot filling score > Intent score) {
      Intent reason is used
      } else if (Slot filling score < Intent score) {
      Slot filling reason is used
      } else {
      Both are used
      }