LLM2Vec: Large language models are secretly powerful text encoders

LLM2Vec: “LLMs are secretly powerful text encoders” - Mila, McGill, ServiceNow

Authors: Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, Siva Reddy

Text-embedding models convert a piece of text, such as a search query, document, or piece of code, into a sequence of real-valued numbers. Given such embeddings, we can measure the similarity, or relatedness, of pieces of text. This facilitates various important applications, such as search, clustering, retrieval, and classification.

With the widespread availability of decoder-only large language models (LLMs), such as GPT-4, LLaMA2, Mistral-7B, and StarCoder2, a pressing question in the natural language processing (NLP) research community is how best to use these models to construct powerful text embeddings.

We’re excited to present LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders, a simple and efficient solution to transform any decoder-only LLM into a powerful text encoder in an unsupervised fashion simply by using adapters (LoRA), without the need to modify the base models.

Below we give an overview of the key components of LLM2Vec and present the exciting results we got when benchmarking LLM2Vec models on the challenging Massive Text Embeddings Benchmark (MTEB). Our LLM2Vec-Mistral ranks first on the MTEB leaderboard in the unsupervised category, first in the supervised category among the models trained on publicly available embedding data (E5), and seventh on the overall leaderboard (the other top six models are trained on synthetic data generated from GPT-4/similar-scale models).

A simple and efficient recipe

At its core, LLM2Vec consists of three simple steps:

Enabling bidirectional attention
Adaptation via masked next-token prediction (MNTP)
Adaptation via unsupervised contrastive learning

Adapting a model with the LLM2Vec approach is highly efficient and works with parameter-efficient fine-tuning methods such as LoRA. Additionally, the adaptation can be performed using a general domain corpus such as Wikipedia, requires only a few hundred training steps, and can be run on a single GPU.

Diagrams for enabling bidirectional attention, masked next-token prediction, and unsupervised contrastive learning

State-of-the-art performance

LLM2Vec is not only simple and efficient, but it also leads to state-of-the-art performance on the challenging MTEB, both in the unsupervised and supervised setting (among models trained only on publicly available data).

Unsupervised results

We applied LLM2Vec to some of the best-performing LLMs available and evaluated the resulting text--embedding models on MTEB. In the unsupervised setting—i.e., without using any labeled training data for contrastive learning—our LLM2Vec-transformed models achieved a new state-of-the-art performance of 56.80, outperforming the previous unsupervised approach by a large margin.

Table showing unsupervised results when applying LLM2Vec to encoder-only LLMs, S-LLaMA-1.3B, LLaMA-2-7B, and Mistral-7B

Supervised results

LLM2Vec can also be easily combined with supervised contrastive learning. As our results show, applying LLM2Vec before supervised contrastive learning leads to a substantial improvement. Moreover, LLM2Vec in combination with Mistral-7B, currently the best-performing 7 billion-parameter LLM, leads to a new state-of-the-art performance of 64.80 on MTEB among models trained only with publicly available data.

Table showing supervised results when applying LLM2Vec to previous work with public data only, S-LLaMA-1.3B, LLaMA-2-7B, and Mistral-7B

Highly sample-efficient

LLM2Vec-transformed models require less training data to perform well compared to training models without the LLM2vec transformation.

Diagrams showing the amount of data needed to train Sheared-LLaMA-1.3B, Llama-2-7b-chat-hf, and Mistral-7B-Instruct-v0.2

These results make us particularly excited about challenging real-world scenarios where large amounts of labeled data might be costly to acquire.

Use it on your own data

We’ve made it easy for you to use our LLM2Vec-transformed models. LLM2Vec class is a wrapper on top of Hugging Face models to support sequence encoding and pooling operations. The steps below showcase an example of how to use the library.

Preparing the model

Here, we first initialize the model and apply MNTP-trained LoRA weights on top. After merging the model with MNTP weights, we can either:

Load the unsupervised-trained LoRA weights (trained with SimCSE objective and wiki corpus)
Load the model with supervised-trained LoRA weights (trained with contrastive learning and public E5 data)

Code to initialize the model and apply MNTP-trained LoRA weights on top

Applying LLM2Vec wrapper

Then, we define our LLM2Vec encoder model as follows:

Code to define the LLM2Vec encoder model

Inference

This model now returns the text embedding for any input in the form of [[instruction1, text1], [instruction2, text2]] or [text1, text2]. While training, we provide instructions for both sentences in symmetric tasks and only for queries in asymmetric tasks.

Code showing the text returned for any input, for both sentences in symmetric tasks and queries in asymmetric tasks

Summary

As demonstrated above, LLM2Vec is a simple unsupervised approach that can transform any pretrained decoder-only LLM into a strong text encoder. If you’re as excited about LLM2Vec as we are, check out our hands-on tutorial, which walks you through the different steps of our method. We also welcome contributions on Github and invite the community to share their LLM2Vec-transformed models.

Research: Project page

Code: LLM2Vec on GitHub

Tutorial: Learn how to apply LLM2Vec to LLaMA-2

Find out more about ServiceNow Research.