Abstract: Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to assure its quality, reliability, and reproducibility.
We propose How2Bench comprising a 55-criteria checklist as a set of guidelines to comprehensively govern the development of code-related benchmarks.
Using How2Bench, we profiled 274 benchmarks released within the past decade and found concerning issues.
Nearly 70\% of the benchmarks did not take measures for data quality assurance;
over 10\% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants and revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency.
For ease of use, we provide a printable version of How2Bench in Appendix.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: code generation and understanding
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 5868
Loading