Abstract: We adapt the popular MBPP dataset for code generation from natural language to emphasize on the natural language aspect by evaluating generated code on multiple sets of assertions. Additionally, we update the text descriptions to remove ambiguity and instructions that are not evaluated by the assertions, like specific algorithms to use. This adapted dataset resolves three problems with the original one: reliance on providing test cases to generate the right signature, contamination of the exact phrasing being present in training datasets, and better alignment between instruction and what is being tested for. We also show results on popular open and closed weight models on the original and adapted datasets.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: Code benchmarks, Code generation, Code evaluation
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 2989
Loading