PromptMix: A class boundary augmentation method for LLM distillation
Authors: Gaurav Sahu, Olga Vechtomova, Dzmitry Bahdanau, and Issam H. Laradji
In the world of text classification, data is the unsung hero. However, the availability of high-quality, annotated datasets is a luxury we don't often have. This scarcity can stifle the performance of machine learning models, especially in specialized domains.
To combat this, we've developed PromptMix, a novel data augmentation technique that leverages the prowess of large language models (LLMs) such as GPT-3. Our strategy? Generate examples that straddle the class boundaries—those tricky gray areas that often make or break a model's accuracy.
Why class boundary?
Examples near class boundaries are critical—they teach models to discern nuances and make finer distinctions. However, they come with a risk: the potential to introduce noisy, incorrect examples. That's where PromptMix's second act of relabeling comes in. We relabel these augmented examples using a prompt-based LLM classifier to maintain label integrity.
PromptMix is a two-step approach:
- First, we generate borderline examples using the prompt template (see Figure 1).
- Sometimes there’s a leakage of examples; i.e., GPT generates an example that’s more faithful to the minority class (the class with lesser percentage). To fix this issue, we relabel the generated sentences using GPT. The template for relabeling is very similar to Figure 1 (see Figure 4 of the paper for a detailed demonstration).
Figure 1: The PromptMix template consists of two parts: Part 1 contains the descriptions of classes to mix and examples (if available); part 2 instructs GPT to generate sentences that are a mixture of the two classes.
We experiment on four benchmark text classification datasets in aggressive zero-shot and two-shot settings (see Figure 2).
Figure 2: The table shows the performance of classifiers trained on data generated by PromptMix and other baselines (we used GPT-3.5 Turbo in our experiments).
Results:
- In the table, A1 denotes the test accuracy of the classifier trained on generated data without relabeling, and A2 denotes the test accuracy of the classifier that was trained on a dataset that was relabeled by GPT. There is a clear increase in the performance of the models, denoting the importance of the relabeling step in our pipeline.
- Next, the performance of PromptMix (A2 column) is quite close to the NN+GPT3.5 baseline, which is the classification performance of GPT-3.5 on the respective datasets. This shows that generating borderline examples aids in distilling the knowledge of an LLM like GPT-3.5 into much smaller language models (DistilBERT and BERT).
- Compared to other data augmentation baselines, PromptMix yields significantly better performance even though some competitive baselines, such as LINDA and GPT3Mix, use much more labeled examples than PromptMix.
- We also note that just adding class descriptions leads to decent performance gains on all the datasets.
Figure 3: The table of input classes shows some qualitative examples illustrating the effectiveness of the mixup strategy. We’ve highlighted parts about age_limit in yellow and parts about atm_support in blue. We see clear evidence of mixup at work, as sentences generated using mixup contain information about both classes.
To conclude, we propose PromptMix, an effective data augmentation strategy when operating in extremely low data settings. PromptMix combines the process of text augmentation, pseudo-labeling, and knowledge distillation in a single, cohesive pipeline. Here’s a link to the supporting codebase: https://github.com/ServiceNow/PromptMix-EMNLP-2023.
Find out more about ServiceNow Research.