Unmasking the Version-Switching Capabilities of Code Generation Models

Nizar Islah; Justine Gehring; Diganta Misra; Eilif Benjamin Muller; Irina Rish; Terry Yue Zhuo; Massimo Caccia

Unmasking the Version-Switching Capabilities of Code Generation Models

Nizar Islah, Justine Gehring, Diganta Misra, Eilif Benjamin Muller, Irina Rish, Terry Yue Zhuo, Massimo Caccia

26 Sept 2024 (modified: 20 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code generation, LLM, code LLM, benchmark, code versioning

TL;DR: LLMs excel at coding but struggle with frequent library updates. GitChameleon, with 116 version-sensitive Python problems, shows models like GPT-4o (pass@10 of 41.6) struggle with version-specific code, highlighting the need for better benchmarks.

Abstract: The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent version updates while maintaining compatibility with previous versions. Existing code completion benchmarks often overlook this dynamic aspect, and the one that does consider it relies on static code prediction tasks without execution-based evaluation, offering a limited perspective on a model's practical usability. To address this gap, we introduce GitChameleon, a novel, manually curated dataset comprising 116 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon is designed to rigorously assess the ability of modern large language models (LLMs) to generate version-specific code that is not only syntactically correct but also functionally accurate upon execution. Our comprehensive evaluations reveal that state-of-the-art LLMs struggle with this task; for instance, \textbf{GPT-4} achieves a pass@10 of only 39.9\% (43.7\% when provided with error feedback), highlighting the complexity of the problem and the limitations of current models. By providing an execution-based benchmark that emphasizes the dynamic nature of code libraries, GitChameleon serves as a critical tool for advancing the development of more adaptable and reliable code generation models. We release the dataset and evaluation framework to encourage further research in this vital area.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8031

Loading