CodeUpdateArena: Benchmarking Knowledge Editing on API Updates

ACL ARR 2025 February Submission1532 Authors

13 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) are increasingly being used to synthesize and reason about source code. The libraries and API functions they invoke are continuously evolving, with functionality being added or changing. Yet, no prior work has studied how an LLM's knowledge about code API functions can be updated. To fill this gap, we present `CodeUpdateArena`, a benchmark for knowledge editing in the code domain. An instance in our benchmark consists of a synthetic API function update paired with a program synthesis example that uses the updated functionality; our goal is to update an LLM to be able to solve this program synthesis example without providing documentation of the update at inference time. Compared to knowledge editing for facts, success here is more challenging: a code LLM must reason about the semantics of the modified function rather than just reproduce its syntax. Our dataset is constructed by first prompting GPT- to generate atomic and executable function updates. Then, for each update, we generate program synthesis examples whose code solutions are prone to use the update. Our benchmark covers updates of various types to 54 functions from seven diverse Python packages, with a total of 670 program synthesis examples. Our experiments show that fine-tuning open-source code LLMs (i.e., DeepSeek, CodeLlama) on documentation of a new update does not allow them to incorporate changes for problem-solving. However, prepending the same information does help, establishing that the information is present, and careful fine-tuning on examples demonstrating the update shows improvement, paving the way for better knowledge editing techniques for code.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: knowledge editing, code large language models, program synthesis
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English, Python
Submission Number: 1532
Loading