Evaluation metrics and calculations
Metrics against which conversations are evaluated and calculation of adjusted scores.
Metrics
| Metric | Description |
|---|---|
| Request Completion | Measures the virtual agent's ability to complete user requests by accurately identifying the user's intent and gathering all required information (slot filling). |
| Intent Accuracy | Shows the virtual agent's ability to comprehend user requests, resulting in relevant responses. |
| Slot Filling | Shows the virtual agent's capability to interpret user responses and extract structured answers to the required questions. |
| Smooth Flowing Conversation (Deadlock avoidance) | Checks if the virtual agent responds dynamically, successfully moving the conversation forward without repetition. |
| Context Retention | Shows if the virtual agent succeeds in retaining and using information provided during the conversation, including request interpretation and slot filling. |
| Truthfulness (Hallucination Prevention) | Shows if the virtual agent generated genuine responses grounded in conversation, excluding fabrication or memory and comprehension failures. |
| Conciseness (Redundancy Avoidance) | Checks the virtual agent's ability to avoid superfluous or verbose and generic responses, which doesn't contribute to the core intent of the conversation. |
| Coherence | Checks for clear logical flow, structure, and organization of the virtual agent's responses. |
| User Satisfaction | Weighted average of all the other metrics on which the conversation was evaluated. |
Calculations
Calculation of deviations and Adjusted Score:
- Upper Deviation
Condition: If the number of human-labeled scores that are higher than the auto-evaluated scores in the last 6 months is more than 30.
Calculation: The top 90% of these cases are taken and the difference (delta) between the human score and the auto-evaluated score is averaged. This delta is the Upper Deviation.
- Lower Deviation
Condition: If the number of human-labeled scores that are lower than the auto-evaluated scores in the last 6 months is more than 30.
Calculation: The top 90% of these cases are taken and the difference (delta) between the human score and the auto-evaluated score is averaged. This delta is the Lower Deviation.
- Adjusted ScoreThe final Adjusted Score is calculated based on the availability of the deviations.
- If at least 30 distinct evaluations of both upper and lower deviations are labeled for a given metric, Error band is calculated as SUM(Avg labeling score – LLM score)/Distinct evaluations. This error band is added to Auto-Eval score to get Adjusted Score.
- If neither deviation is available, then Adjusted Score = Auto-Eval Score
- Auto eval user satisfaction score: For a given evaluation, get all the scores for each metric that are LLM generated and calculate SUM(metric score * metric weight)/SUM(metric weights).
- Human user satisfaction score: For a given evaluation, if at least one metric is labeled, it’s considered to calculate the human user satisfaction score. If labeled, the labeling score is used, or else LLM score is used. Calculated as SUM(metric score * metric weight)/SUM(metric weights).
- Gap: Gap is calculated as (Human user satisfaction score – Auto eval satisfaction score).
- Upper Deviation: If the Gap is positive and there are more than 30 records, the error band is calculated at top 90% by SUM(Positive Gap) / Distinct evaluations. This error band is added to the Auto eval user satisfaction score.
- Lower Deviation: If the Gap is negative and there are more than 30 records, the error band is calculated at top 90% by SUM(Negative Gap) / Distinct evaluations. This error band is added to the Auto eval user satisfaction score.
- Adjusted user satisfaction score is calculated as SUM(Gap) / Distinct evaluations.
- The evaluator provides aggregated score per chat, even if there are multiple different requests made by user.
- Performance Analytics indicators are used to calculate the average score over time. If you run batch jobs on historical data, then by the definition of Performance Analytics indicators, these evaluations are counted on the evaluation date in aggregated scores and not counted for scores on the actual chat date.