Abstract: The continuous emergence of large language models specially capable to deal with programming languages makes crucial the development of better benchmarks that appraise them in terms of their skills.
In this paper we introduced CodeMod, the first benchmark dataset for code modification.
This dataset is evaluated both in zero-shot and fine-tuned configurations utilizing the most recent Large Language Models (LLMs) for code.
We also demonstrate its usefulness by evaluating the performance of fine tuned models in terms of code synthesis performance. We show up to 5 points of improvement on pass@1 performance on the HumanEval benchmark.
This new dataset will be a new addition to the code benchmark landscape.
Paper Type: short
Research Area: Resources and Evaluation
Contribution Types: Data resources
Languages Studied: English, Python, Java, C++
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading