NLU Testing Capabilities and Techniques for your NLU Models

nilimadesai · ‎07-14-2022

Testing your Natural Language Understanding (NLU) model against a set of utterances is an integral part of ensuring your model is performing optimally. The platform allows 3 primary mechanisms for testing your model during different stages of your NLU model and VA topic-building activities from within NLU Workbench and Virtual Agent Designer.

Test single utterances using Try Model from NLU Workbench: Try model allows to test the model using single utterance samples from within Workbench. In doing so, it also provides the capability to mark the result as correct or incorrect as feedback to the model. For incorrect results, an NLU admin can also provide the correct intent from the available intents or pick no intent as the expected outcome. This feedback helps further train the model based on the provided inputs.
Test single utterances from VA Designer: When building VA topics, we can also test utterances from within VA Designer. The NLU model does need to be published in order for the most recent model changes to factor in the predictions from within Designer. Testing from the NLU tab experience in the Designer is similar to testing the samples from within Workbench. In addition, we can also test using the Test Active Topics button in Designer, which allows us to test both NLU as well as the VA topic simultaneously.
Batch Testing tool: Testing a large set of test utterances using

Batch Testing tool feature available as part of the Advanced NLU Workbench allows NLU Admins to test the NLU model by uploading a batch of test utterances and their expected intents to understand how the model is performing and predicting. This in turns helps to tune the model based on what the test results tell us.

Refer to this section for expert tips and tricks for using Batch Testing to tune your NLU models for optimal performance.

- Creating a Batch Test Set to Assess Model Performance
  - Two elements: utterance and expected intent
  - Quantity: at least 50 unique test utterances (ideally 100+) per intent They can be gathered from Open NLU, chat, or incident logs.
  - Quality: representative of how end-users talk. The test samples should not be the exact copy of training utterances Include 15-20% of samples that should not match with any intent.
  - Purpose: Test fallback behavior and adjust the confidence threshold
- Guidance for creating Test Sets from Chat Logs
  - - Identify source of test utterances:
      - Post NLU go Live:
        extract end user samples from open_nlu_predict_intent_feedback table's utterance and prediction columns for a specified time period into excel format and rename the Prediction column in the excel file as 'Expected intent'
        sort the spreadsheet by Expected Intent column and use Excel's unique function to remove any duplicate samples
        finally, review the Expected Intent for the samples and correct any wrongly predicted intents to their correct value and use as the Test Set to import and run Batch Testing
      - Pre NLU go Live:
        where available, use interaction.short_description or interaction_log.utterance data to gather test samples
        create a Batch Testing file with these utterances, leaving the expected intent column empty, and run Batch Testing
        validate the predictions from the detailed Batch Testing results page
        as you do this review, update the 'Expected intent' values in the Batch Testing file with your validated labels as the expected intent
    - Two elements: utterance and expected intent
    - Quantity: at least 50 unique test utterances (ideally 100+) per intent They can be gathered from Open NLU, chat, or incident logs.
    - Quality: representative of how end-users talk. The test samples should not be the exact copy of training utterances Include 15-20% of samples that should not match with any intent.
    - Purpose: Test fallback behavior and adjust the confidence threshold
- Tuning Tips from Batch Testing Results
  - - ServiceNow recommendations for model quality:
      - >80% Correct; <10% Incorrect; <10% Missed
    - Use the initially created test set as a “golden data set” for future use. Whenever you make changes to the model in the future, you will have a reality check test suite to ensure you don’t have any regression in your model performance
    - Reviewing Batch Testing results:
      - Look for overall tuning opportunities:
        Identify patterns of errors: maybe some terminology is not represented in the model
        Identify opportunities to add vocabulary
        If there are persistent errors, double-check that the model follows our best practices
      - For samples that the model missed predicting:
        If this is expected, consider whether you need to support the request (or is it just gibberish?) and have a KB or catalogue item that will provide the user with the information they require
        If this is unexpected, consider adding a few representative samples into the intents where those samples belong
      - For samples that the model incorrectly predicted:
        Investigate the predicted intent to see which samples might be contributing to the ambiguity
        Investigate the expected intent and make sure it too has clear and sufficient samples

Additional NLU Related Resources:

Additional NLU troubleshooting KBs: