RepLiQA: A more robust benchmark for QA

REpLiQA: Benchmarking LLMs on unseen reference content

Authors: João Monteiro, Pierre-André Noël, Étienne Marcotte, Sai Rajeswar, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, and Perouz Taslakian

When we think about how large language models (LLMs) such as ChatGPT answer questions, it’s often based on a reference document, such as a news article, a recipe, or a specific company policy.

In many cases, we need answers to rely on specific documents, some of which are publicly available and widely known, whereas others are difficult to access or private. For example, finding out who proved Fermat’s Last Theorem is easy to look up online (it’s Andrew Wiles, of course), and LLMs can easily answer such questions.

What about questions regarding more specific or privileged information, such as the parental leave policy of a particular company? It remains unclear how well LLMs can handle these scenarios.

Evaluating the performance of LLMs on questions about novel or unseen content, where the topic is unknown to them, reveals interesting insights. This performance can be contrasted with how well these models can answer questions based on general knowledge that’s freely available on the internet.

Meet RepLiQA, a dataset we designed to help evaluate models’ capability to process unseen content. RepLiQA is a collection of reference documents, each associated with multiple question-answer pairs, where each answer is based on the content of the corresponding document.

These documents feature imaginative scenarios created by human content writers, making the content truly unique and thus not seen by models at training time. This setup helps us evaluate how effectively AI models can handle novel information.

How do well-known models perform on RepLiQA?

Figure 1 illustrates the performance of various popular LLMs on our RepLiQA dataset and on TriviaQA, a dataset containing questions about widely known facts. In both scenarios, each model was given the question and the reference document and was asked to get the answer from the information in the document.

Figure 1: Performance of state-of-the-art LLMs on the task of question-answering in RepLiQA and TriviaQA. The results show that all models perform significantly better on TriviaQA than on RepLiQA, suggesting that models might rely on memory and not acquired reading skills to solve these tasks.

Why is RepLiQA more difficult?

The answer to this question is somewhat nuanced, especially due to the lack of transparency in the training processes of many LLMs. Nevertheless, the question raises important considerations for how we evaluate models using public datasets.

How do we explain this performance gap? Most popular LLMs have been pretrained or fine-tuned on data available online. For datasets such as TriviaQA that have lived on the internet for a long time, we cannot rule out the possibility that models have been trained on the data splits of TriviaQA itself.

Moreover, answers to questions about well-known facts often rely on widely accessible sources, such as Wikipedia. Therefore, it’s reasonable to assume that most LLMs have been exposed to at least some TriviaQA content and might even "remember" answers they “memorized” during training.

Such dataset-leakage and information-contamination considerations would explain why models seem to perform well on public datasets such as TriviaQA but fail to generalize to new content. Further investigation into this phenomenon supports this, revealing that even when no reference document is given to the model, the performance of most LLMs remains virtually unchanged.

Our experiment results on this topic are depicted in Figure 2. We prompted models with just a question and no reference document. In this no-context scenario, models perform poorly on RepLiQA but show an insignificant drop when tested on TriviaQA.

This observation confirms our claim that most facts and entities in RepLiQA are novel and were not part of the pretraining data of any of the evaluated models.

Impact of the presence or absence of context (the reference document) when answering questions for various models on both RepLiQA and TriviaQA Figure 2: Impact of the presence or absence of context (the reference document) when answering questions for various models on both RepLiQA and TriviaQA. On TriviaQA, increasing model size leads to increased performance. However, this improvement is partly due to memorization, as larger models are consistently better than their smaller counterparts when tested on TriviaQA without context, whereas results are mixed when tested on RepLiQA with context.

The conclusion here is that evaluating a model on datasets such as TriviaQA is insufficient and charts showing high performance may be misleading, as one cannot with certainty attribute good performance to acquired reading skills (or memorization).

Thus, we need complementary evaluations to assess whether a model’s performance would persist on new reference documents, on topics that are not readily available, such as, “What is my company's parental leave policy?”

How was RepLiQA created?

RepLiQA is a reading comprehension and question-answering dataset consisting of synthetic documents, each containing approximately 1,000 words and accompanied by five question-answer pairs such that the answers can be located within the associated document’s text (or there is a mention that the question cannot be answered using the document).

We contracted a for-profit data annotation company to create the documents and prepare question-answer pairs based on the content of each. Once the data was created, we performed some post-processing of the data—namely, converting PDF files into text, cleaning up the metadata, and splitting the dataset into five balanced splits. Figure 3 shows the full pipeline of the dataset creation process.

RepLiQA creation process. Reference documents were created by human content creators, who then provided a summary of each document to a group that generated questions based on those summaries. Figure 3: RepLiQA creation process. Reference documents were created by human content creators, who then provided a summary of each document to a group that generated questions based on those summaries. Each original document and the generated questions were then passed to answer annotators, who wrote the answers based on the document. After extensive quality control checks, we processed the data and created five test splits. The first split was released in June 2024, and each of the remaining four splits will be released on a bimonthly basis starting in December 2024.

Accessing RepLiQA

RepLiQA can be easily downloaded from Hugging Face and loaded with only a few lines of code. Getting RepLiQA is as simple as typing datasets.load_dataset("ServiceNow/repliqa")

For additional details, including how to access the original PDF files, refer to the tutorial:

https://github.com/ServiceNow/repliqa/blob/main/tutorial.ipynb

For more details on the dataset and information about the experiments we conducted, see our paper: https://arxiv.org/abs/2406.11811.

Find out more about ServiceNow Research.