Data augmentation for intent classification with off-the-shelf large language models

Data augmentation for intent classification with Large Language Models

Following our recent research presented at the proceedings of the 4th Workshop on NLP for Conversational AI at ACL 2022, this blog post discusses our latest work exploring how to best leverage the power of Large Language Models to generate more training data. Gaurav Sahu is first author for this paper, and contributed this blog post. Marie-Ève Marchand and Sean Hughes also contributed as reviewers and editors of this article.

Bibtex:

@inproceedings{sahu-etal-2022-data,
title = "Data Augmentation for Intent Classification with Off-the-shelf Large Language Models",
author = "Sahu, Gaurav and Rodriguez, Pau and Laradji, Issam and Atighehchian, Parmida and Vazquez, David and Bahdanau, Dzmitry",
booktitle = "Proceedings of the 4th Workshop on NLP for Conversational AI",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = https://aclanthology.org/2022.nlp4convai-1.5,
pages = "47--57",
}

In enterprise AI, there is a rising need to make chatbots more intelligent. One of the challenges is to increase their ability to classify an utterance into an intent. In other words, how can we improve the performance of a task-oriented chatbot at identifying what the user wants to accomplish? Training such intent classifiers proves difficult because often, there’s not enough labelled data.
Using off-the-shelf large language models like GPT-3 looks very promising to generate training data even for complex classification setups like intent classification.

We only need to ask the right question!

Method

Generating new training data with GPT is super-easy: show it your examples, and it will do the rest! All of this without the headache of fine-tuning custom large language model generators.

Prompt template fed to GPT. The prompt is incomplete, and GPT generates good and bad examples.

Figure 1: Prompt template fed to GPT. Note that the prompt is incomplete, and GPT generates good (blue) and bad (red) examples.

The art of crafting a good prompt is somewhat mystical, and in all fairness, we didn’t land on ours on the first try. However, this lore deserves an article of its own. For now, note that the more organized the prompt, the better the generations.

In our study, we started with 10 examples/intent (aka seed examples), and considered three scenarios:

Baseline: How good is the intent classifier if trained only on the seed examples?
Augmented: Does adding GPT generations improve the baseline classifier?
Augmented + Relabeled: How much would human relabeling of generations help? ¹

We tested our method on four intent classification datasets of varying granularities: CLINC150, HWU64, Banking77, and SNIPS. And here are the results:

Intent classification performance of Bert-large model

Figure 2: Intent classification performance of Bert-large model.

Time to draw conclusions!

The good:

GPT rocks! GPT-generations immediately boost the baseline 10-shot classifier.
Relabeling further helps. We observe that GPT beats EDA in the Augmented + Relabeled scenario, which showcases the higher utility of GPT-generations.

The bad:

We don’t see as much of a “kick” for HWU64 and Banking77.

Overall, we see some real promises from GPT, but there’s also a lingering question: why doesn’t GPT stand out more on the fine-grained HWU64 and Banking77 datasets?

The case of confounding intents

To debunk the mystery, we looked at the generated examples and found that GPT often gets confused between intents that are semantically close to each other aka confounding intents. Consider the following figure.

The case of confounding intents

Figure 3: The case of confounding intents

On the left, we see a table with Davinci² generations, the respective seed intents they were generated for, and the labels assigned to them by the oracle model. Consider the last row of the table. Davinci generates, “Did my master card top-up fail?” for “pending_top_up” whereas it should belong to “top_up_failed” as the oracle correctly suggests.

On the right, we see the label distribution³ of Davinci generations for “topping_up_by_card” and notice that only a third of the generated sentences belong to that class.

It’s now clear that GPT has a hard time understanding these niche intents, but…Can GPT learn these minute differences? Our small-scale study suggests yes!

Here, we use GPT as a classifier, wherein we first show it seed examples from three closely related intents and then predict a label for a generated sentence. We find that GPT is surprisingly good as a classifier, and it can even identify mismatched generations.

Rejecting mismatched generations using GPT as a classifier provides significant gains in three-way fidelity.

Figure 4: Rejecting mismatched generations using GPT as a classifier provides significant gains in three-way fidelity. We also note that Davinci is even better than the 10-shot baseline classifier!

Long story short

GPT can generally generate valuable data for an intent by looking at just a few examples.
However, it struggles to distinguish between closely related intents (e.g., “Pending top-up” v/s “Failed top-up”).
But it can learn that difference by seeing examples from closely related intents!

Here's the supporting codebase link and a quick video overview: https://github.com/ElementAI/data-augmentation-with-llms/

Data augmentation Python script to open AI in sandbox for intent classification

Are you interested in using language models for Conversational AI?

If you are interested in exploring full-time career opportunities with the ServiceNow Research team or wish to learn more about the part-time Visiting Researcher Program (for research internships), please take a moment to fill out this form so that hiring managers can learn more about your background and we can contact you about our current openings.

Please note that ServiceNow Research internships through our Visiting Researcher Program start and run all year and are not limited to "seasonal" applications.

Follow @ServiceNowRSRCH on Twitter for our latest news, and updates from the community, and to get in touch.

¹ To simulate human relabeling, we train an oracle model on the full dataset and relabel the generations

² Davinci is the largest and the most powerful GPT-3 engine available (~175B parameters).

³ As suggested by the oracle model