The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers

Hussein Mozannar; Valerie Chen; Mohammed Alsobay; Subhro Das; Sebastian Zhao; Dennis Wei; Manish Nagireddy; Prasanna Sattigeri; Ameet Talwalkar; David Sontag

The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers

Hussein Mozannar, Valerie Chen, Mohammed Alsobay, Subhro Das, Sebastian Zhao, Dennis Wei, Manish Nagireddy, Prasanna Sattigeri, Ameet Talwalkar, David Sontag

Published: 10 Feb 2025, Last Modified: 10 Feb 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Authors that are also TMLR Expert Reviewers: ~Dennis_Wei1

Abstract: Evaluation of large language models for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer assistants, we study whether gains on existing benchmarks or more preferred LLM responses translate to programmer productivity when coding with LLMs, including time spent coding. We introduce RealHumanEval, a web interface to measure the ability of LLMs to assist programmers, through either autocomplete or chat support. We conducted a user study (N=243) using RealHumanEval in which users interacted with seven LLMs of varying base model performance. Despite static benchmarks not incorporating humans-in-the-loop, we find that improvements in benchmark performance lead to increased programmer productivity; however gaps in benchmark versus human performance are not proportional---a trend that holds across both forms of LLM support. In contrast, we find that programmer preferences do not correlate with their actual performance, motivating the need for better proxy signals. We open-source RealHumanEval to enable human-centric evaluation of new models and the study data to facilitate efforts to improve code models.

Certifications: Expert Certification

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: After acceptance: - Added more details around naming of platform "RealHumanEval" Before acceptance: - We added Table 2 which provides short task descriptions for each of the 17 tasks. - We added Figure 5 to the paper that shows a raw plot of RealHumanEval metrics to HumanEval and MBPP performance. Additionally we computed the Pearson Correlation between RealHumanEval human performance numbers and static benchmark score and found a non-significant correlation but one needs to look at the raw data in Figure 5 to reveal the nature of the correlation. - We have removed the numbering of the findings to reduce confusion - Fixed minor formatting issues - We have adjusted the writing to more carefully describe how we operationalize our measure of productivity via the two metrics and after that only refer to the the metrics themselves to avoid general claims about productivity benefits. - We have restructured the Discussion section to talk about how future work can build on both the platform and data collected from our study.

Code: https://github.com/clinicalml/realhumaneval

Assigned Action Editor: ~Sanghyun_Hong1

Submission Number: 3494

Loading