Announcing BigCode for the responsible development of large language models

BigCode by ServiceNow Research and Hugging Face

We’re excited to announce the BigCode project, led by ServiceNow Research and Hugging Face. In the spirit of the BigScience initiative,¹ we aim to develop state-of-the-art large language models (LLMs) for code in an open and responsible way.

Code LLMs enable the completion and synthesis of code, both from other code and natural language descriptions, and work across a wide range of domains, tasks, and programming languages. These models can assist professional and citizen developers with coding new applications.

BigCode invites AI researchers to collaborate on the following topics:

A representative evaluation suite for code LLMs covering a diverse set of tasks and programming languages
Responsible development and governance of data sets for code LLMs
Faster training and inference methods for LLMs

The first goal of BigCode is to develop and release a data set large enough to train a state-of-the-art language model for code. We’ll ensure that only files from repositories with permissive licenses go into the data set.

With that data set, we’ll train a 15-billion-parameter language model for code using ServiceNow’s in-house GPU cluster. With an adapted version of Megatron-LM, we’ll train the LLM on the distributed infrastructure.

Once the model is trained, we’ll evaluate its capabilities. While there are numerous benchmarks available for natural language processing (NLP), the landscape of benchmarks suited for code is much sparser. We’ll strive to make evaluation easier and broader so that we can learn more about the model’s capabilities.

Academic research usually stops after evaluation; this is where the work for practical applications starts. Inference speed is crucial for applications such as autocompletion. We’re interested in making architectural changes and devising tools for post-training optimization.

We’ll follow, as well as establish, responsible AI practices to train and share LLMs. We’ll uphold the principles of openness and transparency in the LLM development process. Experiments can be expensive and take a long time to run, so we’ll share the scientific plan with participants to solicit feedback before we execute it.

AI practitioners from diverse backgrounds are invited to join the BigCode project. The invitation is open to those who have a professional AI research background and can commit time to the project.

In general, we expect applicants to be affiliated with a research organization (either in academia or industry) and work on the technical/ethical/legal aspects of LLMs for coding applications.

Learn more about the project on the official website, and join the conversation on Twitter @BigCodeProject.

¹ The BigScience initiative is a scientific collaboration that culminated in July 2022 with the release of BLOOM, the world’s largest open multilingual language model.