PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLM

Anonymous

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLM

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of **HumanEval** and **MBPP**, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks, potentially inflating model performance estimations. To address these limitations, we propose a novel benchmark, *PythonSaga*, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts across diverse difficulty levels. The code and dataset are openly available to the NLP community at https://anonymous.4open.science/r/PythonSaga.

Paper Type: long

Research Area: Resources and Evaluation

Contribution Types: Data resources, Data analysis

Languages Studied: English

0 Replies

Loading