Evaluating Tokenizer Adaptation Methods for Large Language Models on Low-Resource Programming Languages
Keywords: Large Language Models, Code Generation, Programming Languages, Low-resource Programming Languages
TL;DR: Evaluation of Code LLMs adapted to low-resource programming languages using tokenizer adaptation methods.
Abstract: Large language models, mostly trained on high-resource programming languages, but perform sub-optimally for low-resource ones. This study investigates the impact of tokenizer adaptation methods on improving code generation for LRPLs. We evaluate popular StarCoder 2 and DeepSeek-Coder model adapted to Elixir and Racket using methods such as Fast Vocabulary Transfer (FVT), FOCUS, and Zero-shot Tokenizer Transfer (ZeTT). Our experiments reveal that ZeTT outperforms other methods, achieving significant improvements in handling syntax, program logic, and data types for LRPLs. However, we also highlight performance declines in non-target languages like Python after tokenizer adaptation. The study approves the positive impact of tokenizer adaptation in enhancing LRPL code generation and suggest directions for future research, including token embeddings improvement.
Archival Status: Archival
Paper Length: Short Paper (up to 4 pages of content)
Submission Number: 196
Loading