How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs

ACL ARR 2025 February Submission5868 Authors

16 Feb 2025 (modified: 23 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to assure its quality, reliability, and reproducibility. We propose How2Bench comprising a 55-criteria checklist as a set of guidelines to comprehensively govern the development of code-related benchmarks. Using How2Bench, we profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70\% of the benchmarks did not take measures for data quality assurance; over 10\% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants and revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency. For ease of use, we provide a printable version of How2Bench in Appendix.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: code generation and understanding
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 5868
Loading