DataRecipe - How to Cook the Data for CodeLLM?

Published: 01 Jan 2024, Last Modified: 19 Feb 2025ASE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Despite the proliferation of language models, a lack of transparency persists regarding the training datasets used. Security concerns are often cited, but identifying high-quality training data is crucial for optimal model performance. Yet, while significant efforts have been made to improve model performance, dataset quality remains an under-explored area. Our study addresses this gap by comprehensively investigating data-quality properties and processing strategies used to train code generation models. We focus on identifying dataset features that impact model performance and leverage these insights to optimize datasets and enhance model efficacy. Our approach involves a multifaceted analysis encompassing metadata, statistics, data quality issues, semantic correlations between intent and code, and design choices. By manipulating these features, we explore their influence on model performance. Our findings reveal that dataset design choices significantly impact the performance of code generation models. Additionally, semantic correlations between intent and code can also affect performance, although to varying degrees.
Loading