CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack OverflowDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: New dataset for code generation specialized in development aid.
Abstract: We introduce a novel dataset tailored for code generation, aimed at aiding developers in common tasks. Our dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests. It encompasses a range of libraries such as Pandas, Numpy, and Regex, along with standard Python code derived from Stack Overflow. Comprising 3,402 meticulously crafted examples by Python experts, our dataset is designed for both model finetuning and standalone evaluation. The examples have been carefully refined to reduce data contamination, a process confirmed by the performance of three leading models: Mistral 7B, CodeLLAMA 13B, and Starcoder 15B. This dataset not only involves an average of three unit tests but also categorizes examples in order to get more fine grained analysis, enhancing the understanding of models' strengths and weaknesses in specific coding tasks. The benchmark can be accessed at \texttt{anonymized address}.
Paper Type: long
Research Area: Machine Translation
Contribution Types: Data resources, Data analysis
Languages Studied: English , Python
0 Replies

Loading