CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source CodeDownload PDF

Published: 26 Mar 2022, Last Modified: 20 Oct 2024DL4C 2022 SpotlightReaders: Everyone
Abstract: Recent works has widely adopted large language model pretraining for source code, suggested source code-specific pretraining objectives and investigated the applicability of various Transformer-based language model architectures for source code. This work investigates another important aspect of such models, the effect of different subtokenization options, and aims at identifying most effective and length-efficient subtokenizations, taking into account source code specifics. We propose subtokenziation that reduces average length by 17--40% without downstream performance drop, and show that a carefully chosen subtokenization may significantly improve quality by 0.5-2%, possibly with some length increase.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/codebpe-investigating-subtokenization-options/code)
1 Reply

Loading