StackEval: Benchmarking LLMs in Coding Assistance

Published: 26 Sept 2024, Last Modified: 13 Nov 2024NeurIPS 2024 Track Datasets and Benchmarks PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Benchmark, Transparency in AI, Real Word Use-Cases, Evaluation Metrics
TL;DR: We introduce two coding benchmarks - StackEval and StackUnseen - to evaluate language models' performance on real programming tasks, along with a comprehensive framework to assess how well LLMs can judge coding solutions.
Abstract: We present two comprehensive benchmarks to evaluate the performance of language models in coding assistance tasks, covering code writing, debugging, code review, and conceptual understanding. Our main contribution includes two curated datasets: StackEval, a large-scale benchmark derived from Stack Overflow questions, and StackUnseen, a dynamic benchmark featuring the most recent Stack Overflow content. These benchmarks offer novel insights into the capabilities and limitations of LLMs, particularly in handling new and emerging content. Additionally, we assess LLMs' proficiency as judges for coding tasks using a curated, human-annotated dataset, exploring their evaluation capabilities and potential biases, including whether they favor their own generated solutions. Our findings underscore the potential of these benchmarks to advance LLM development and application in coding assistance. To ensure reproducibility, we publicly share our datasets and evaluation code at https://github.com/ProsusAI/stack-eval.
Supplementary Material: zip
Submission Number: 1879
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview