The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

28 May 2024 (modified: 13 Nov 2024)Submitted to NeurIPS 2024 Track Datasets and BenchmarksEveryoneRevisionsBibTeXCC BY 4.0
Keywords: code LLM, evaluation, ai-assisted programming
TL;DR: We introduce RealHumanEval, a platform for human-centric evaluation of LLMs for programming, and present an extensive human study with RealHumanEval and release the resulting user interaction dataset.
Abstract: Evaluation of large language models for code has primarily relied on static benchmarks, including HumanEval, or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer assistants, we study whether gains on existing benchmarks or more preferred LLM responses translate to programmer productivity when coding with LLMs, including time spent coding. We introduce RealHumanEval, a web interface to measure the ability of LLMs to assist programmers, through either autocomplete or chat support. We conducted a user study (N=213) using RealHumanEval in which users interacted with six LLMs of varying base model performance. Despite static benchmarks not incorporating humans-in-the-loop, we find that improvements in benchmark performance lead to increased programmer productivity; however gaps in benchmark versus human performance are not proportional---a trend that holds across both forms of LLM support. In contrast, we find that programmer preferences do not correlate with their actual performance, motivating the need for better, human-centric proxy signals. We also open-source RealHumanEval to enable human-centric evaluation of new models and the study data to facilitate efforts to improve code models.
Supplementary Material: zip
Submission Number: 1263
Loading