How Do Your Code LLMs perform? Empowering Code Instruction Tuning with Really Good Data

ACL ARR 2024 June Submission5536 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we find Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we discover many datasets suffer from significant data leakage. After cleaning up most of the leaked data, we find that some datasets previously considered high-quality perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data selection strategy for selecting samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Experiments show Xcoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: code generation and understanding
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 5536
Loading