LLM2Vec: Large language models are secretly powerful text encoders
Authors: Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, Siva Reddy
Text-embedding models convert a piece of text, such as a search query, document, or piece of code, into a sequence of real-valued numbers. Given such embeddings, we can measure the similarity, or relatedness, of pieces of text. This facilitates various important applications, such as search, clustering, retrieval, and classification.
With the widespread availability of decoder-only large language models (LLMs), such as GPT-4, LLaMA2, Mistral-7B, and StarCoder2, a pressing question in the natural language processing (NLP) research community is how best to use these models to construct powerful text embeddings.
We’re excited to present LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders, a simple and efficient solution to transform any decoder-only LLM into a powerful text encoder in an unsupervised fashion simply by using adapters (LoRA), without the need to modify the base models.
Below we give an overview of the key components of LLM2Vec and present the exciting results we got when benchmarking LLM2Vec models on the challenging Massive Text Embeddings Benchmark (MTEB). Our LLM2Vec-Mistral ranks first on the MTEB leaderboard in the unsupervised category, first in the supervised category among the models trained on publicly available embedding data (E5), and seventh on the overall leaderboard (the other top six models are trained on synthetic data generated from GPT-4/similar-scale models).
A simple and efficient recipe
At its core, LLM2Vec consists of three simple steps:
- Enabling bidirectional attention
- Adaptation via masked next-token prediction (MNTP)
- Adaptation via unsupervised contrastive learning
Adapting a model with the LLM2Vec approach is highly efficient and works with parameter-efficient fine-tuning methods such as LoRA. Additionally, the adaptation can be performed using a general domain corpus such as Wikipedia, requires only a few hundred training steps, and can be run on a single GPU.
State-of-the-art performance
LLM2Vec is not only simple and efficient, but it also leads to state-of-the-art performance on the challenging MTEB, both in the unsupervised and supervised setting (among models trained only on publicly available data).
Unsupervised results
We applied LLM2Vec to some of the best-performing LLMs available and evaluated the resulting text--embedding models on MTEB. In the unsupervised setting—i.e., without using any labeled training data for contrastive learning—our LLM2Vec-transformed models achieved a new state-of-the-art performance of 56.80, outperforming the previous unsupervised approach by a large margin.
Supervised results
LLM2Vec can also be easily combined with supervised contrastive learning. As our results show, applying LLM2Vec before supervised contrastive learning leads to a substantial improvement. Moreover, LLM2Vec in combination with Mistral-7B, currently the best-performing 7 billion-parameter LLM, leads to a new state-of-the-art performance of 64.80 on MTEB among models trained only with publicly available data.
Highly sample-efficient
LLM2Vec-transformed models require less training data to perform well compared to training models without the LLM2vec transformation.
These results make us particularly excited about challenging real-world scenarios where large amounts of labeled data might be costly to acquire.
Use it on your own data
We’ve made it easy for you to use our LLM2Vec-transformed models. LLM2Vec class is a wrapper on top of Hugging Face models to support sequence encoding and pooling operations. The steps below showcase an example of how to use the library.
Preparing the model
Here, we first initialize the model and apply MNTP-trained LoRA weights on top. After merging the model with MNTP weights, we can either:
- Load the unsupervised-trained LoRA weights (trained with SimCSE objective and wiki corpus)
- Load the model with supervised-trained LoRA weights (trained with contrastive learning and public E5 data)
Applying LLM2Vec wrapper
Then, we define our LLM2Vec encoder model as follows:
Inference
This model now returns the text embedding for any input in the form of [[instruction1, text1], [instruction2, text2]] or [text1, text2]. While training, we provide instructions for both sentences in symmetric tasks and only for queries in asymmetric tasks.
Summary
As demonstrated above, LLM2Vec is a simple unsupervised approach that can transform any pretrained decoder-only LLM into a strong text encoder. If you’re as excited about LLM2Vec as we are, check out our hands-on tutorial, which walks you through the different steps of our method. We also welcome contributions on Github and invite the community to share their LLM2Vec-transformed models.
Research: Project page
Code: LLM2Vec on GitHub
Tutorial: Learn how to apply LLM2Vec to LLaMA-2
Find out more about ServiceNow Research.