One-to-Many Testing for Code Generation from (Just) Natural Language

One-to-Many Testing for Code Generation from (Just) Natural Language

ACL ARR 2024 June Submission2989 Authors

15 Jun 2024 (modified: 05 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We adapt the popular MBPP dataset for code generation from natural language to emphasize on the natural language aspect by evaluating generated code on multiple sets of assertions. Additionally, we update the text descriptions to remove ambiguity and instructions that are not evaluated by the assertions, like specific algorithms to use. This adapted dataset resolves three problems with the original one: reliance on providing test cases to generate the right signature, contamination of the exact phrasing being present in training datasets, and better alignment between instruction and what is being tested for. We also show results on popular open and closed weight models on the original and adapted datasets.

Paper Type: Short

Research Area: Resources and Evaluation

Research Area Keywords: Code benchmarks, Code generation, Code evaluation

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 2989

Loading