Code Generation

This document serves as an overview of the different mechanisms and areas of governance in the BigCode project. It aims to support …

Sean Hughes, Harm de Vries, Jennifer Robinson, Carlos Muñoz Ferrandis, Loubna Ben Allal, Leandro von Werra, Jennifer Ding, Sébastien Paquet, Yacine Jernite

ArXiv, 2024.

RepoFusion: Training Code Models to Understand Your Repository

Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand …

Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, Torsten Scholak

ArXiv, 2023.

StarCoder: may the source be with you!

The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code …

Raymond Li, Loubna Ben Allal, Yangtian Zi, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Jia Li, Jenny Chim, Terry Yue Zhuo, Thomas Wang, Mishig Davaadorj, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Ming-Ho Yee, Logesh Kumar Umapathi, Benjamin Lipkin, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf , Arjun Guha, Leandro von Werra, Harm de Vries, Joel Lamy Poirier, Alex Gu, Armel Zebaze, Jian Zhu, Manan Dey, Marc Marone, Mayank Mishra, Muhtasham Oblokulov, Olivier Dehaene, Qian Liu, Tri Dao, Wenhao Yu, Niklas Muennighoff

Transactions on Machine Learning Research (TMLR), 2023.

Multilingual Code Retrieval Without Paired Data: A New Benchmark and Experiments

We seek to overcome limitations to code retrieval quality posed by the scarcity of data containing pairs of code snippets and natural …

João Monteiro, Torsten Scholak, Virendra Mehta, David Vazquez, Christopher Pal

Workshop at the International Conference on Learning Representations (ICLR), 2023.

SantaCoder: don't reach for the stars!

The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This …

Harm de Vries, Raymond Li, Joel Lamy Poirier, Dzmitry Bahdanau, Denis Kocetkov, Sean Hughes

Workshop at the International Conference on Learning Representations (ICLR), 2023.

The Stack: 3 TB of permissively licensed source code

Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)–not only for natural …

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf , Dzmitry Bahdanau, Leandro von Werra, Harm de Vries

Transactions on Machine Learning Research (TMLR), 2022.

Towards Neural Functional Program Evaluation

This paper explores the capabilities of current transformer-based language models for program evaluation of simple functional …

Torsten Scholak, Jonathan Pilault, Joey Velez-Ginorio

Conference on Neural Information Processing Systems (NeurIPS), 2021.

Picard: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models

Large pre-trained language models for textual data have an unconstrained output space; at each decoding step, they can produce any of …

Torsten Scholak, Nathan Schucher, Dzmitry Bahdanau

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.

DuoRAT: Towards Simpler Text-to-SQL Models

Recent neural text-to-SQL models can effectively translate natural language questions to corresponding SQL queries on unseen databases. …

Torsten Scholak, Raymond Li, Dzmitry Bahdanau, Harm de Vries, Christopher Pal

North American Chapter of the Association for Computational Linguistics (NAACL), 2021.