The Stack: 3 TB of permissively licensed source code

Denis Kocetkov; Raymond Li; Loubna Ben allal; Jia LI; Chenghao Mou; Yacine Jernite; Margaret Mitchell; Carlos Muñoz Ferrandis; Sean Hughes; Thomas Wolf; Dzmitry Bahdanau; Leandro Von Werra; Harm de Vries

The Stack: 3 TB of permissively licensed source code

Denis Kocetkov, Raymond Li, Loubna Ben allal, Jia LI, Chenghao Mou, Yacine Jernite, Margaret Mitchell, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro Von Werra, Harm de Vries

Published: 07 Jun 2023, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called "Am I in The Stack" for developers to search The Stack for copies of their code (https://hf.co/spaces/bigcode/in-the-stack), and provide a process for code to be removed from the dataset.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: Camera-ready version

Code: https://github.com/bigcode-project

Supplementary Material: pdf

Assigned Action Editor: ~Swarat_Chaudhuri1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 643

Loading