Evaluation flow for batch evaluations

  • リリースバージョン: Australia
  • 更新日 2026年03月12日
  • 所要時間:11分
  • Batch evaluation enables Eval admins to evaluate up to 100 completed virtual agent conversations at once, based on a saved query.

    Flow name: Execute Batch Evaluation.

    The flow creates evaluation records and invokes Now Assist skills for each eligible conversation, mirroring the single-conversation evaluation logic, but at scale. It enforces HR scope exclusions, topic/category validation, transcript construction rules, early live-agent exclusions, and asynchronous scoring through skills.

    Batch evaluations are performed using the following logic:

    Trigger
    • Table: Evaluation set [sn_na_conv_eval_evaluation_set]
    • Condition: State changes to In Progress and Evaluation type = Conversation
    Inputs
    • Evaluation Set record with:
      • Query filter: A query that targets conversations to be evaluated (for example, sys_cs_conversation filters).
      • Evaluation type: Conversation
      • State: In Progress (to start)
    • LLM/Skills: Chat Topic Classifier, plus the evaluation skills listed after this.
    High-level behavior
    • Reads the query filter and randomly samples up to 100 conversations.
    • Skips already-evaluated conversations.
    • Excludes HR-scoped interactions.
    • Uses Chat Topic Classifier to validate evaluation eligibility and extracts Topic and Category.
    • Builds a transcript with controlled inclusion of Knowledge articles and catalog sources, and applies early live agent exclusions.
    • Creates an Evaluation record and asynchronously invokes all selected evaluation skills, writing scores and rationale to metrics.

    Sequence of execution:

    Action 1: If the query filter isn’t empty
    • Purpose: Guard clause.
    • Logic: Look up the Evaluation Set record and check the query filter field.
    • If the query filter is present: Proceed to Action 2.
    • If empty: Stop and optionally log No query provided.
    Action 2: Randomize conversations
    • Purpose: Select a bounded, random sample of conversations from the provided query.
    • Logic:
      • Execute the query to get matching conversation records.
      • Randomly select up to 100 conversations.
        • If >100 matches, cap at 100.
        • If <100, select all.
      • Validate the query; if invalid, return false and an empty or partial array.
    • Outputs:
      • success: true/false
      • conversation_ids: array of sys_ids (max 100)
    • If success = true: Proceed to Action 3; otherwise, stop and log the validation error.
    Action 3: Look up the evaluation table to check prior evaluation
    • Purpose: Avoid duplicate evaluations.
    • Logic: For each conversation sys_id, check sn_na_conv_eval_evaluation for existing records indicating that it's already evaluated or is in progress (implementation choice: state not in canceled/failed).
    • If not previously evaluated: Proceed to Action 4 for that conversation.
    • If already evaluated: Skip this conversation, optionally log Already evaluated.
    Action 4: Look up the interaction record
    • Purpose: Enforce HR scope exclusion.
    • Logic: Resolve the interaction related to the conversation. If its application scope contains hr, skip the conversation.
    • If the scope doesn’t contain hr: Proceed to Action 5.
    Action 5: buildTranscript
    • Purpose: Construct the final, minute-level transcript and determine downstream skill set and guardrails.
    • Steps:
      • Aggregate all conversation messages.
      • Tag user messages as [User]: and virtual agent messages as [Virtual Agent]:.
      • Knowledge articles:
        • If genius results reference Knowledge articles, query the Knowledge article and replace the genius snippet with the entire article body.
        • Annotate with [Virtual Agent]: Help articles for user query: and wrap content within Article_Start and Article_End.
        • Constraints:
          • If the KB is HR-scoped or inaccessible, don't evaluate (skip conversation).
          • Truncate the article body to a maximum of 10,000 words.
          • If the KB content source is attached files (PDF/Word/Txt), fall back to the genius result instead of full file content.
      • Catalog Items:
        • If genius results reference catalog items, query sc_cat_item and build a string: catalog name, short description, description.
        • Annotate with [Virtual Agent]: Please choose one of the below options: and include citation order.
      • Live Agent Exclusions:
        • If the first user message requests a live agent, skip evaluation.
        • If a live agent is invoked within the first 120 words, skip evaluation.
    • Outputs:
      • ExecuteEvaluation: true/false (post-guardrail outcome)
      • Chat transcript
      • Knowledge articles referred
      • Catalog items referred
      • First live agent occurrence: Sys_id of the conversation message (if present)
      • Skills to invoke:
        • Coherence Chat Evaluation
        • Conciseness Chat Eval
        • Context Retention
        • Inadequate Slot Filling Chat Eval
        • Intent Accuracy Chat Eval
        • Smooth Flowing Conversation Chat Eval
        • Truthfulness Hallucination Chat Eval
      • Additional logs
    • If ExecuteEvaluation = true: Proceed to Action 7; otherwise, skip the conversation.
    Action 6: If Block
    • Purpose: Branch to record creation.
    • Logic: If ExecuteEvaluation from Action 6 is true, go to Action 8.
    Action 7: Chat Classifier Eval
    • Purpose: Validate whether the conversation should be evaluated and extract high-level labels.
    • Logic:
      • Build a lightweight transcript from sys_cs_message for classification input.
      • Invoke Chat topic classifier skill with the transcript.
      • Receive:
        • Execute evaluation: true/false
        • Topic Name
        • Category: IT or HR
    • If Execute evaluation = true: Proceed to Action 6.
    • If false: Skip conversation and log the classifier decision.
    Action 8: Create or Update evaluation record
    • Purpose: Persist an evaluation entry for this conversation.
    • Table: sn_na_conv_eval_evaluation
    • Field population:
      • Document conversation: Conversation reference
      • State: processing
      • Topic: from Action 5
      • Category: from Action 5
      • KB Referred: from Action 6
      • Catalog Referred: from Action 6
      • First live agent occurrence: from Action 6
      • Type: chat summarization
      • User: initiating user for the conversation
      • Message log: Additional logs from Action 6
    • On success: Proceed to Action 9.
    Action 9: For Loop over skills
    • Purpose: Execute each selected evaluation skill.
    • For each skill in the list from Action 6:
      • Action 10: invokeApiDefinition
        • Inputs: Skill Name, Conversation, Transcript, Evaluation Id
        • Behavior:
          • Invoke the Now Assist skill asynchronously.
          • The post-processor writes results into sys_generative_ai_response_validator.
          • Extract JSON response fields:
            • Score
            • Reason for Score
            • Examples supporting the reasoning
          • Create child metric records in sn_na_conv_eval_evaluation_metrics linked to the parent evaluation.
      • Action 11: Wait

        Pause seven seconds before proceeding to the next skill to manage rate limits or throttling.