CodeUpdateArena: Benchmarking Knowledge Editing on API Updates

Zeyu Leo Liu; Shrey Pandit; Xi Ye; Eunsol Choi; Greg Durrett

CodeUpdateArena: Benchmarking Knowledge Editing on API Updates

Zeyu Leo Liu, Shrey Pandit, Xi Ye, Eunsol Choi, Greg Durrett

21 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: knowledge editing, code large language models, program synthesis

TL;DR: A new task and synthetic dataset to benchmark knowledge editing techniques for code large language models.

Abstract: Large language models (LLMs) are increasingly being used to synthesize and reason about source code. The libraries and API functions they invoke are continuously evolving, with functionality being added or changing. Yet, no prior work has studied how an LLM's knowledge about code API functions can be updated. To fill this gap, we present `CodeUpdateArena`, a benchmark for knowledge editing in the code domain. An instance in our benchmark consists of a synthetic API function update paired with a program synthesis example that uses the updated functionality; our goal is to update an LLM to be able to solve this program synthesis example without providing documentation of the update at inference time. Compared to knowledge editing for facts, success here is more challenging: a code LLM must reason about the semantics of the modified function rather than just reproduce its syntax. Our dataset is constructed by first prompting GPT-$4$ to generate atomic and executable function updates. Then, for each update, we generate program synthesis examples whose code solutions are prone to use the update. Our benchmark covers updates of various types to 54 functions from seven diverse Python packages, with a total of 670 program synthesis examples. Our experiments show that fine-tuning open-source code LLMs (i.e., DeepSeek, CodeLlama) on documentation of a new update does not allow them to incorporate changes for problem-solving. However, prepending the same information does help, establishing that the information is present, and careful fine-tuning on examples demonstrating the update shows improvement, paving the way for better knowledge editing techniques for code.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2276

Loading