BigDocs: An open multimodal dataset for document understanding

BigDocs: An open multimodal dataset for document understanding; generated by Grok from X.com Image generated by AI

Despite technological advances, documents continue to be a vital part of daily life, often combining text with visual elements such as charts, tables, and diagrams to provide context and meaning. To enable AI systems to mirror humanlike understanding of such documents, models must move beyond basic optical character recognition (OCR) to comprehend layouts, diagrams, and even hand-drawn sketches.

Progress in this area, however, is hindered by a lack of accessible, high-quality training data, much of which is proprietary, restrictively licensed, or fragmented. This is where BigDocs comes in.

What is BigDocs?

BigDocs is a comprehensive solution for advanced document understanding. It consists of two key components:

BigDocs-7.5M: A high-quality, open-access dataset of 7.5 million multimodal documents spanning 30 tasks
BigDocs-Bench: A benchmark suite with 10 real-world-inspired tasks, such as reasoning over graphical interfaces and generating code from images

Let’s explore how BigDocs and BigDocs-Bench meet the rising demand for sophisticated document understanding technologies.

Limitations of current AI models and datasets

Current vision language models (VLMs) excel in many tasks but struggle with tasks that are complex, such as generating long, structured outputs or interpreting user interfaces. In addition, open-source VLMs, which rely on publicly available datasets, significantly underperform closed-source VLMs when it comes to solving such complex tasks. Bridging this gap involves overcoming several obstacles:

Scarcity of open datasets: Many datasets for training VLMs aren’t publicly available. When they are, details about their content are often missing. This lack of transparency slows progress in the field.
Simple tasks in open datasets: Publicly available datasets often address only basic tasks, such as OCR or questions and answers. Such data isn’t enough to develop models capable of handling complex, real-world challenges.
Restrictive licensing: Many datasets have unclear or restrictive licenses, making them difficult to use for business purposes, stifling innovation.
Limited benchmarks: Existing benchmarks overlook critical challenges, such as tasks involving complex visual documents or generating structured outputs, such as HTML and JSON—tasks essential in real-world applications.

BigDocs-7.5M: Visual document understanding

BigDocs-7.5M is a large-scale, license-permissive dataset carefully curated to advance visual document understanding across diverse document types and tasks. Designed to enable robust model training, it supports complex document analysis, reasoning, and manipulation.

For the curation process, we:

Acquired existing datasets: We compiled 133 public VLM datasets from academic repositories, open data platforms, and research publications.
Filtered licenses: We rigorously reviewed the licensing information of each dataset, which led to retaining only 16 datasets with fully permissive licenses, ensuring compatibility with commercial use.
Enhanced data collection: To expand capabilities, we crawled additional data targeting complex documents and structured outputs, including arXiv PDF OCR, diagrams and table parsing, and HTML, JSON, and Scalable Vector Graphics (SVG) generation.

To ensure transparency, we created detailed data sheets for each dataset, capturing key metadata such as ownership, licensing, and references. These data sheets enhanced reproducibility and traceability.

Additionally, we developed the BigDocs Toolkit—a modular suite designed for preprocessing, filtering, contamination control, metadata management, and dataset loading (see Figure 1).

The BigDocs toolkit streamlines the integration of large-scale document datasets while maintaining high quality.

Figure 1: The BigDocs toolkit streamlines the integration of large-scale document datasets while maintaining high quality.

BigDocs-7.5M spans 30 tasks, grouped into three main categories:

Document information extraction encompasses enhanced OCR for diverse document types, layout analysis, and table detection.
Document understanding focuses on semantic comprehension tasks, such as document classification, question answering, and diagram analysis.
Document creation and manipulation involves transforming visual data into structured formats such as HTML, LaTeX, Markdown, and JSON.

BigDocs-Bench: Redefining benchmarking

Since BigDocs-7.5M is designed to train models for complex document understanding tasks, existing benchmarks fall short. They primarily target simple tasks and cannot adequately measure the complex task performance that BigDocs-7.5M enables.

BigDocs-Bench, designed to evaluate advanced VLM capabilities, addresses this limitation. Featuring 10 downstream tasks (see Figure 2), BigDocs-Bench tests models on their ability to process visually rich documents and produce long, structured outputs. These tasks span diverse challenges in document understanding, from fine-grained analysis to creative generation, including:

GUI2UserIntent: Capturing fine-grained user intent from graphical user interfaces
Screenshot2HTML: Converting screenshots into structured HTML, demonstrating creative generation capabilities

A selection of downstream tasks, showcasing the diverse and complex real-world applications supported by BigDocs-Bench Figure 2: A selection of downstream tasks, showcasing the diverse and complex real-world applications supported by BigDocs-Bench

Practical implications and benefits

BigDocs-7.5M delivers transformative benefits for multimodal document understanding, particularly in enterprise settings, where transparency, customizability, and efficiency are critical. One key application is state-of-the-art fine-tuning of any pretrained multimodal model for enhanced performance (see Figure 3).

State-of-the-art fine-tuning of a pretrained multimodal model for enhanced performance

Figure 3: State-of-the-art fine-tuning of a pretrained multimodal model for enhanced performance

Our evaluations show that training open-source models with BigDocs-7.5M can lead to significant performance gains—improving accuracy by up to 34.5%. Remarkably, these fine-tuned models surpass even proprietary leaders such as GPT-4, achieving a 25.8% higher score on BigDocs-Bench.

A performance comparison of well-known models, highlighting the impact of BigDocs-7.5M in enabling competitive and customizable AI solutions Figure 4: A performance comparison of well-known models, highlighting the impact of BigDocs-7.5M in enabling competitive and customizable AI solutions

Real-world applications

BigDocs-7.5M empowers organizations to transform their massive amount of unstructured data, such as dashboards and workflows, into structured formats that seamlessly integrate into downstream workflows. Additionally, it enables the automation of repetitive tasks, including understanding and interacting with GUIs and web elements, to drive efficiency and scalability in real-world applications.

With BigDocs-7.5M, we aim to set a new standard for transparency and responsibility in developing multimodal document understanding benchmarks. We’re committed to promoting open science and driving progress within the AI community. We also advocate for future researchers to embrace these responsible practices.

Learn more in our BigDocs paper and on our project website. And check out our BigDocs dataset.

Find out more about our latest research.