ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference

Kevin Jesse, Premkumar T. Devanbu

2022 (modified: 14 Dec 2022)MSR 2022Readers: Everyone

Abstract: In this paper, we present ManyTypes4TypeScript, a very large corpus for training and evaluating machine-learning models for sequence-based type inference in TypeScript. The dataset includes over 9 million type annotations, across 13,953 projects and 539,571 files. The dataset is approximately 10x larger than analogous type inference datasets for Python, and is the largest available for Type-Script. We also provide API access to the dataset, which can be integrated into any tokenizer and used with any state-of-the-art sequence-based model. Finally, we provide analysis and performance results for state-of-the-art code-specific models, for baselining. ManyTypes4TypeScript is available on Huggingface, Zenodo, and CodeXGLUE.

0 Replies