DevEval: A Code Generation Benchmark for Practical Software Projects

Anonymous

DevEval: A Code Generation Benchmark for Practical Software Projects

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: How to evaluate Large Language Models (LLMs) in code generation is an open question. There is currently no benchmark for practical software projects. In this paper, we propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects. DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects and covering 10 domains. Compared to previous benchmarks, DevEval aligns to practical projects in multiple dimensions, e.g., real program distributions, sufficient dependencies, and enough-scale project contexts. We assess 12 popular LLMs on DevEval (e.g., gpt-4, gpt-3.5-turbo, Claude 2, GLM-4, CodeLLaMa, StarCoder, and Mistral) and reveal their actual abilities in code generation. For instance, the highest Pass@1 of gpt-3.5-turbo only is 42.97% in our experiments. We also discuss the challenges of code generation in practical projects. We open-source DevEval and hope it can facilitate the development of code generation in practical projects.

Paper Type: long

Research Area: Resources and Evaluation

Contribution Types: Data resources

Languages Studied: Programming Languages

0 Replies

Loading