Following our recent research presented at the proceedings of the 4th Workshop on NLP for Conversational AI at ACL 2022, this blog post discusses our latest work exploring how to best leverage the power of Large Language Models to generate more training data. Gaurav Sahu is first author for this paper, and contributed this blog post. Marie-Ève Marchand and Sean Hughes also contributed as reviewers and editors of this article.
title = "Data Augmentation for Intent Classification with Off-the-shelf Large Language Models",
author = "Sahu, Gaurav and Rodriguez, Pau and Laradji, Issam and Atighehchian, Parmida and Vazquez, David and Bahdanau, Dzmitry",
booktitle = "Proceedings of the 4th Workshop on NLP for Conversational AI",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.nlp4convai-1.5",
pages = "47--57",
In enterprise AI, there is a rising need to make chatbots more intelligent. One of the challenges is to increase their ability to classify an utterance into an intent. In other words, how can we improve the performance of a task-oriented chatbot at identifying what the user wants to accomplish? Training such intent classifiers proves difficult because often, there’s not enough labelled data.
Using off-the-shelf large language models like GPT-3 looks very promising to generate training data even for complex classification setups like intent classification.
We only need to ask the right question!
Generating new training data with GPT is super-easy: show it your examples, and it will do the rest! All of this without the headache of fine-tuning custom large language model generators.
Figure 1: Prompt template fed to GPT. Note that the prompt is incomplete, and GPT generates good (blue) and bad (red) examples.
The art of crafting a good prompt is somewhat mystical, and in all fairness, we didn’t land on ours on the first try. However, this lore deserves an article of its own. For now, note that the more organized the prompt, the better the generations.
In our study, we started with 10 examples/intent (aka seed examples), and considered three scenarios:
We tested our method on four intent classification datasets of varying granularities: CLINC150, HWU64, Banking77, and SNIPS. And here are the results:
Figure 2: Intent classification performance of Bert-large model.
Time to draw conclusions!
We don’t see as much of a “kick” for HWU64 and Banking77.
Overall, we see some real promises from GPT, but there’s also a lingering question: why doesn’t GPT stand out more on the fine-grained HWU64 and Banking77 datasets?
The case of confounding intents
To debunk the mystery, we looked at the generated examples and found that GPT often gets confused between intents that are semantically close to each other aka confounding intents. Consider the following figure.
Figure 3: The case of confounding intents
On the left, we see a table with Davinci2 generations, the respective seed intents they were generated for, and the labels assigned to them by the oracle model. Consider the last row of the table. Davinci generates, “Did my master card top-up fail?” for “pending_top_up” whereas it should belong to “top_up_failed” as the oracle correctly suggests.
On the right, we see the label distribution3 of Davinci generations for “topping_up_by_card” and notice that only a third of the generated sentences belong to that class.
It’s now clear that GPT has a hard time understanding these niche intents, but…Can GPT learn these minute differences? Our small-scale study suggests yes!
Here, we use GPT as a classifier, wherein we first show it seed examples from three closely related intents and then predict a label for a generated sentence. We find that GPT is surprisingly good as a classifier, and it can even identify mismatched generations.
Figure 4: Rejecting mismatched generations using GPT as a classifier provides significant gains in three-way fidelity. We also note that Davinci is even better than the 10-shot baseline classifier!
Long story short
Are you interested in using language models for Conversational AI?
If you are interested in exploring full-time career opportunities with the ServiceNow Research team or wish to learn more about the part-time Visiting Researcher Program (for research internships), please take a moment to fill out this form so that hiring managers can learn more about your background and we can contact you about our current openings.
Please note that ServiceNow Research internships through our Visiting Researcher Program start and run all year and are not limited to "seasonal" applications.
Follow @ServiceNowRSRCH on Twitter for our latest news, and updates from the community, and to get in touch.
1 To simulate human relabeling, we train an oracle model on the full dataset and relabel the generations
2 Davinci is the largest and the most powerful GPT-3 engine available (~175B parameters).
3 As suggested by the oracle model
© 2022 ServiceNow, Inc. All rights reserved. ServiceNow, the ServiceNow logo, Now, and other ServiceNow marks are trademarks and/or registered trademarks of ServiceNow, Inc. in the United States and/or other countries. Other company names, product names, and logos may be trademarks of the respective companies with which they are associated.