InsightBench: An enterprise benchmark for multifaceted data analytics

Faceless, suited detective holding a magnifying glass in a gloved hand

Photo credit: ChatGPT; Authors: Gaurav Sahu, Abhay Puri, Sai Rajeswar, and Issam Hadj Laradji

Highlights

We introduced InsightBench as a comprehensive, reliable benchmark for business data analytics.
We introduced AgentPoirot as an end-to-end data analytics agent that significantly outperforms PandasAgent on InsightBench, which can be attributed to AgentPoirot’s ability to conduct multifaceted analytics on a dataset.
We studied the efficacy of various prompt engineering methods, conducted a thorough dataset collection process, and adopted reliable evaluation metrics to ensure InsightBench sets a solid foundation for future research in this field.

InsightBench is a pioneering benchmark specifically designed for evaluating AI models on their proficiency in conducting end-to-end data analytics. It stands out with the following features:

Diverse datasets: The benchmark comprises 31 tabular datasets sourced from the ServiceNow platform, covering a wide range of business themes, such as finance, customer service, and operations.
Thematic variety: It spans five distinct themes, each representing a different facet of business operations, thereby providing a comprehensive testing ground for AI models.
Automated evaluation: The evaluation process is fully automated, reducing the need for human intervention and facilitating unbiased assessments.
Real-world relevance: The datasets are representative of real-life business analytics scenarios, ensuring that the insights generated by AI models are applicable and valuable in practical settings.

Real-world data analysis involves complex, multistep processes that demand a deep understanding of the underlying data. However, when we consider the capabilities of large language models (LLMs) in the realm of data analytics, the focus tends to be on solving singular, isolated tasks that are well defined—for instance, writing code to train a linear regression model and compute an R² score on some dataset.

Existing benchmarks for evaluating the analytical capabilities of LLMs concentrate on these simpler, single-query responses. They do not account for the iterative nature of business analytics, where an analyst must first identify a goal before moving on to the process of exploration.

They continuously refine their queries based on earlier results or shift focus as new insights emerge. For instance, although an LLM might excel at predicting sales trends from past data, it may struggle to contextualize these predictions within broader business strategies or operational challenges unless explicitly guided to do so.

To bridge this gap, we introduced InsightBench to assess an LLM’s ability to handle end-to-end data analytics workflows. InsightBench includes a series of interconnected datasets that span various business operations—from finance to incident management.

Dataset collection process

InsightBench derives its data from the ServiceNow platform, where we created 31 tabular datasets covering five key themes in business enterprises:

Incident management
User management
Enterprise goal management
Inventory management
Finance management

To create a dataset, we selected a relevant data table from the ServiceNow system table and then used its column names to synthesize real-looking data using GPT-3. We then planted a trend or irregularity in every dataset. Finally, we conducted a human study that confirmed that the datasets and planted insights are high quality.

An example dataset in InsightBench includes hardware incident reports across an organization. The LLM would be expected to understand the domain and report any potentially interesting insights, anomalies, or trends, such as discovering that an incident category has significantly higher report volumes due to a recent shortage in staff. Figure 1 shows some data points in InsightBench.

Instances of data in InsightBench: incident management, user management, assets management Figure 1: Instances of data in InsightBench

InsightBench tests LLMs on both their technical data processing capabilities and their ability to understand diverse business contexts. Unlike traditional benchmarks, this setup closely mirrors the complexity of real-world data analytics and provides a robust platform for developing and testing the next generation of AI-powered business analytics tools. We used LLaMA Eval to evaluate the insights generated by agents on InsightBench.

AgentPoirot: End-to-end data analytics agent

We also developed AgentPoirot, an LLM-powered autonomous business analytics agent that can skillfully navigate InsightBench’s complex datascape. Not only can AgentPoirot ask complex questions for any given dataset, but it can also automatically find answers to those questions, making it a truly end-to-end business analytics agent. (See Figures 2 and 3.)

AgentPoirot vs PandasAgent on InsightBench

Figure 2: AgentPoirot vs. PandasAgent on InsightBench

Performance of agents on InsightBench by dataset category Figure 3: Performance of agents on InsightBench by dataset category

We can draw two conclusions based on the above graphs:

AgentPoirot significantly outperforms PandasAgent (PA) on InsightBench, which can be attributed to AgentPoirot’s ability to conduct multifaceted analytics on a dataset.
While using GPT-4o leads to the best performance overall, the open-source AgentPoirot variant using LLaMA-3-70b outperforms GPT-3.5-turbo in most cases and is on par with GPT-4-turbo in some cases.

Our paper, InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation, and the project page include a detailed analysis of AgentPoirot and our evaluation procedure. Overall, we see the effectiveness of AgentPoirot over other agents in a real-world setting, illustrating a significant step forward in the application of LLMs to business analytics.

Getting started with AgentPoirot

Want to try AgentPoirot on your own dataset? Visit our AgentPoirot GitHub page.

Call for collaborators

We invite the research community to interact with our benchmark. If you want to contribute by adding more datasets to the benchmark or by improving the agent, you can do so by opening an issue on our GitHub page.

Download InsightBench.

Watch an AgentPoirot demo.

Find out more about ServiceNow Research.