BigCode collaboration introduces The Stack

The Stack, 3TB of permissive code data, with love from BigCode

Following the announcement of BigCode on Sept. 26, 2022, by ServiceNow Research and Hugging Face, researchers from the project have released The Stack, a 3TB dataset of permissively licensed source code, to the research community. A supporting research paper that includes data analysis, experiments, and a discussion of results and limitations details how data was collected, deduplicated, and filtered for permissive source code licenses.

With the release of The Stack, BigCode aims to provide more transparency on the development of large language models for code (code LLMs), unlike other research groups that have released code LLMs but have not released their training data. While The Stack was created from public, permissively licensed source code data, the BigCode collaboration is experimenting with ways to provide developers with the option to request that their code be removed from the dataset.

As outlined in the BigCode collaboration announcement, code LLMs are trained on large collections of source code data to enable the synthesis of programs from both natural language descriptions and code snippets. These code LLMs have the potential to assist professional developers with programming tasks, for example, by autocompleting code snippets, generating docstrings for a given function signature and body, or suggesting unit tests for a codebase.

Imagine a future in which mission-critical legacy software becomes easier to maintain, as professional developers are guided by code-generating applications on how to write robust code in unfamiliar programming languages, and citizen developers can call on the technology to help develop modern applications in low-code/no-code environments.

Preliminary experiments to assess dataset quality show performances indicating that near-deduplication is an important preprocessing step for achieving competitive results on Text2Code benchmarks. Details of experiments can be found in the paper.

Dataset collection

Dataset collection took several months to complete. To create The Stack, 220.92 million unique GitHub source code repository names were collected from GH Archive, with 51.76 billion files (and 5.28 billion unique files) successfully downloaded from 137.36 million public and accessible repositories. The uncompressed size of all stored files is 92.36TB.

The Stack dataset collection: GH Archive - query - 220M repo names - git clone - Raw dataset - selecting file extensions - 28TB of data - license filtering - 3TB of data - near-deduplication - 1.5TB of data

Comparison with other code datasets

In comparing The Stack with CodeParrot, AlphaCode, CodeGen, and PolyCoder, researchers noted that AlphaCode and CodeGen did not release their data, but they did provide statistics on the amount of data per programming language.

The Stack and CodeParrot provide source code for 30 programming languages, while AlphaCode, PolyCoder, and CodeGen provide source code data for only 12, 12, and six languages, respectively. The Stack is more than three times the size of CodeParrot, the next-largest publicly available code dataset, and is also bigger than CodeParrot for each individual programming language.

The Stack is more than three times the size of CodeParrot, the next-largest released code dataset.

Toward a permissive license dataset

The goal of The Stack is to enable researchers from academia and industries to collaborate on research and development of large language models for code applications by releasing a dataset that can be safely used for pre-training and deployed into applications.

To this end, The Stack’s dataset was filtered to include only permissive licenses—i.e., those with minimal restrictions on how the software can be copied, modified, and redistributed (e.g., MIT and Apache 2.0). Copyleft licenses such as GPL are not included as they have the requirement that the same rights be preserved in derivative works. Some have argued that a model trained with copyleft licenses is considered derivative work.

It was brought to our attention that licenses such as MPL, LGPL, and EPL were erroneously labeled as permissive and were included in the dataset when they are, in fact, weak copyleft licenses. The dataset working group will remove these weak copyleft license files from The Stack and will consider additional license types to include in the dataset. The weak copyleft-licensed data that will be removed is only a small part of the overall dataset. Hence, we expect the experimental findings to remain unchanged.

One of the challenges encountered in compiling the permissive-license dataset is that each file may be part of multiple repositories. In this instance, files are included in the dataset if at least one of the repositories containing the file has a permissive license. Empty files, files larger than 1MB, and files that could not be decoded are removed from the dataset.

We found that the detector did not detect a license for approximately 81% of the repositories, in which case the repository was excluded from the dataset.

Near-deduplication implemented on top of exact deduplication

When building software, snippets of code are often copied or configuration files reused with slightly altered settings. This leads to exact duplicates, as well as near duplicates, where two files are the same except for a few changes. Since training on duplicated data has a negative impact on model performance, these duplicates are removed from the dataset.

Exactly 38.6% of files in the permissive-license dataset were found to be near-duplicates of other files and were removed; this represents 53.7% of the volume of the dataset. While the all-license dataset contains more than 29TB of data, selecting only permissively licensed files reduces the dataset to 3.1TB. Near-deduplication of the permissive-license dataset further reduces it by about half.

For the permissive-license dataset, the four biggest code bases are HTML (746GB), JavaScript (486GB), Java (271GB), and C (22GB). These comprise more than 55% of the total dataset size.

Histogram of the amount of data per programming language for the permissive license dataset

Source code licenses

Source code license information was available for only 26.4 million repositories. The Go License Detector was run on the remaining 110.9 million repositories. Only source code with permissive licenses was targeted for inclusion in the dataset. MIT and Apache 2.0 are the most frequently detected license types from the targeted repositories.

Legal, ethics, and governance

The legal, ethics, and governance working group continues to explore topics such as licensing (including copyleft and the intended use of permissively licensed code), attribution of generated code to original code, rights to restrict processing, the inclusion of personally identifiable information (PII), and risks of malicious code. The working group discusses these challenges for the scalable implementation of potential solutions.

Ongoing work

Ongoing work is focused on datasets, model evaluation, model inference, tokenization for code, model training, and more. More than 300 collaborators joined the project in the first month. For full details about the dataset, dataset analysis, related experiments, and future work, please review the paper and follow the BigCode project. Researchers who wish to contribute to the project can apply at https://www.bigcode-project.org/docs/about/join/.

Acknowledgements

This article draws content from the research paper. Thanks to the authors, Denis Kocetkov (ServiceNow Research), Raymond Li (ServiceNow Research), Loubna Ben Allal (Hugging Face), Jia Li (independent researcher), Chenghao Mou (independent researcher), Carlos Muñoz Ferrandis (Hugging Face), Sean Hughes (ServiceNow Research), Thomas Wolf (Hugging Face), Dzmitry Bahdanau (ServiceNow Research), Leandro von Werra (Hugging Face), and Harm de Vries (ServiceNow Research).

Special thanks to Sean Hughes, Harm de Vries, and Leandro von Werra for their collaboration in writing this article.