Announcing StarCoder2 and The Stack v2, a BigCode collaboration
We’re excited to introduce StarCoder2, a family of open-access large language models (LLMs) developed through open scientific collaboration with the BigCode community under the stewardship of ServiceNow Research and Hugging Face—and with model training by ServiceNow, Hugging Face, and NVIDIA.
Open-access Code LLMs, such as StarCoder2, should be thought of as foundation models. Machine learning (ML) engineers can fine-tune them and create more powerful specialized LLMs to perform tasks such as code generation, code completion, and natural language text summarization—and even power conversational technical chatbot assistants. Of course, the models can be used as is to perform some of these tasks, but without fine-tuning, results will vary.
Accessible source code
In order to build StarCoder2, we curated a new dataset that we named The Stack v2, the largest pretraining corpus for code LLMs, with new data sources, including the Software Heritage archive and Kaggle.
The Software Heritage archive is the largest public collection of source code in existence. We couldn’t have trained StarCoder2 without all the developers and scientists who contribute to the great library of source code.
As shared by Software Heritage, StarCoder2 and The Stack v2 represent the first milestone in making “the vast body of knowledge embedded in humankind’s source code more accessible and reusable for a much broader community.”
The Stack v2 yields a raw dataset of 67.5TB and approximately 900 billion training tokens—10 times larger than the first version, which was only 6.4TB with about 200 billion training tokens. That means the final training set is more than four times larger than the first StarCoder project, with data from more trusted sources.
Careful data preparation included the deduplication, scanning, and removal of malicious code; decontamination; the removal of code from developers who opted out of having their data used for training; and the scanning and deletion of personally identifiable information (PII).
Openness at the core
LLMs can be developed with different levels of openness. The BigCode project, which started in September 2022, is focused on open and responsible scientific collaboration. It brings together more than 1,100 members from academia, industries, startups, and independent researchers in a completely transparent collaboration. This includes data sourcing, research, experiments, training, evaluation, governance, and more.
In support of the responsible development of LLMs for code mission, outputs are either open sourced or made available with open-access licenses, depending on the intended use cases.
The model weights are available under an OpenRAIL license, and training data is traceable using SoftWare Heritage persistent IDentifiers (SWHIDs). The model’s supporting code will reside on the BigCode project’s GitHub page.
Improved performance
The StarCoder2 project has three parameter models: 3 billion, 7 billion, and 15 billion. StarCoder2 3 billion-parameter model matches the performance of StarCoder1’s 15 billion-parameter model.
In our recent evaluations, StarCoder2 outperformed other code LLMs of similar size on most benchmarks, even models that are more than twice its size. The 15 billion-parameter model:
- Outperformed models of comparable size, such as Code Llama 13B, and matched Code Llama 34B
- Matched or outperformed DeepSeekCoder-33B on low-resource programming languages, as well as math and code reasoning
- Underperformed DeepSeekCoder-33B on high-resource programming languages
By ensuring transparency regarding our training data and the model weights, we seek to increase trust in the model and inspire other researchers to improve upon it.
We encourage you to read the paper StarCoder2 and The Stack v2: The Next Generation for more details about how the models and dataset were created and evaluated.
Find out more about ServiceNow Research.