Inquiry regarding Test Guidelines for Virtual Agent Knowledge Search

晃秀黒 · ‎02-20-2026

We are currently implementing a new knowledge search function for the Virtual Agent.
In preparation for acceptance testing, we are developing test guidelines. Could you please provide us with best practices based on recommendations and examples regarding the following three points?

1. Recommended Test Items Are there any essential aspects that should be covered to evaluate search accuracy and user experience? We are currently considering the following items:

Keyword Matching: Search accuracy using exact wording.

Natural Language Understanding: Search accuracy using sentences or spoken language.

Zero-hit Handling: System behavior when no matching articles are found.

2. Number of Test Data To statistically evaluate search accuracy, what is the typical number of data points (number of knowledge articles and test queries) required?

3. Evaluation Metrics Please advise if there are any metrics generally used as a "passing grade" for evaluating response accuracy.

vaishali231 · ‎02-22-2026

hey @晃秀黒

Recommended Test Items

To properly evaluate both search accuracy and user experience, testing should cover the following areas:

Keyword Matching
Verify search accuracy using:Exact keywords

Partial keywords

Synonyms

Abbreviations

Common misspellings

The correct article should ideally appear within the top 1 to 3 results.

Natural Language Understanding
Test full conversational queries such as:
“I forgot my VPN password, how do I reset it?”

“My email is not working on my phone.”

Evaluate:

Relevance of returned results

Logical ranking order

Avoidance of unrelated articles

Zero-hit Handling
Validate system behavior when no article matches:
User-friendly fallback message

Suggested alternative keywords

Option to contact support or create a case

Ranking Quality
Even when the correct article appears:
Is it ranked first?

Are outdated articles ranked above updated ones?

Are low-quality articles ranked too high?

Context Awareness
Test behavior based on:
User roles

Department-based visibility

Language filtering

Authenticated vs non-authenticated users

Conversational User Experience
Assess:
Number of steps to resolution

Article preview clarity

Formatting inside chat

Need for escalation

Recommended Test Data Volume

There is no strict universal standard, but the following is commonly recommended:

Knowledge Base Size:

Minimum 200 to 500 articles for meaningful testing

Preferably 1,000+ for mature environments

Test Queries:

Minimum 50 queries

Recommended 100 to 300 queries for statistical reliability

Suggested distribution for 100 queries:

20 exact keyword searches

20 natural language queries

20 ambiguous queries

15 typo or misspelled queries

15 zero-result scenarios

10 complex or edge-case queries

Using at least 100 test queries provides a more stable evaluation baseline.

Evaluation Metrics and Passing Criteria

Precision at Top Results
Precision@1: Correct article appears at rank 1
Precision@3: Correct article appears within top 3
Recommended benchmarks:

Precision@1 >= 70 to 80 percent

Precision@3 >= 85 to 90 percent

Success Rate
Percentage of queries resolved without escalation
Recommended:

Minimum 75 percent

85 to 90 percent indicates strong performance

Zero-hit Rate
Percentage of queries returning no results
Target:

Less than 5 to 10 percent

Escalation Rate
Percentage of sessions requiring live agent support
Target:

Less than 20 percent

Average Time to Resolution
Time from query submission to article access
Recommended:

Under 30 seconds on average

*************************************************************************************************************************************

If this response helps, please mark it as Accept as Solution and Helpful.

Doing so helps others in the community and encourages me to keep contributing.

Regards

Vaishali Singh

View solution in original post

vaishali231 · ‎02-22-2026

hey @晃秀黒

Recommended Test Items

To properly evaluate both search accuracy and user experience, testing should cover the following areas:

Keyword Matching
Verify search accuracy using:Exact keywords

Partial keywords

Synonyms

Abbreviations

Common misspellings

The correct article should ideally appear within the top 1 to 3 results.

Natural Language Understanding
Test full conversational queries such as:
“I forgot my VPN password, how do I reset it?”

“My email is not working on my phone.”

Evaluate:

Relevance of returned results

Logical ranking order

Avoidance of unrelated articles

Zero-hit Handling
Validate system behavior when no article matches:
User-friendly fallback message

Suggested alternative keywords

Option to contact support or create a case

Ranking Quality
Even when the correct article appears:
Is it ranked first?

Are outdated articles ranked above updated ones?

Are low-quality articles ranked too high?

Context Awareness
Test behavior based on:
User roles

Department-based visibility

Language filtering

Authenticated vs non-authenticated users

Conversational User Experience
Assess:
Number of steps to resolution

Article preview clarity

Formatting inside chat

Need for escalation

Recommended Test Data Volume

There is no strict universal standard, but the following is commonly recommended:

Knowledge Base Size:

Minimum 200 to 500 articles for meaningful testing

Preferably 1,000+ for mature environments

Test Queries:

Minimum 50 queries

Recommended 100 to 300 queries for statistical reliability

Suggested distribution for 100 queries:

20 exact keyword searches

20 natural language queries

20 ambiguous queries

15 typo or misspelled queries

15 zero-result scenarios

10 complex or edge-case queries

Using at least 100 test queries provides a more stable evaluation baseline.

Evaluation Metrics and Passing Criteria

Precision at Top Results
Precision@1: Correct article appears at rank 1
Precision@3: Correct article appears within top 3
Recommended benchmarks:

Precision@1 >= 70 to 80 percent

Precision@3 >= 85 to 90 percent

Success Rate
Percentage of queries resolved without escalation
Recommended:

Minimum 75 percent

85 to 90 percent indicates strong performance

Zero-hit Rate
Percentage of queries returning no results
Target:

Less than 5 to 10 percent

Escalation Rate
Percentage of sessions requiring live agent support
Target:

Less than 20 percent

Average Time to Resolution
Time from query submission to article access
Recommended:

Under 30 seconds on average

*************************************************************************************************************************************

If this response helps, please mark it as Accept as Solution and Helpful.

Doing so helps others in the community and encourages me to keep contributing.

Regards

Vaishali Singh

vaishali231 · ‎02-23-2026

hey @晃秀黒

Hope you are doing well.

Did my previous reply answer your question?

If it was helpful, please mark it as correct ✓ and close the thread . This will help other readers find the solution more easily.

Regards,
Vaishali Singh

晃秀黒 · ‎03-04-2026

Hi Vaishali,

Thank you for following up. Yes, your previous reply was very helpful and clearly answered my question.

I have marked it as the correct solution and will close the thread now. Thanks again for your support!

Best regards,