GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities

GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities

ACL ARR 2025 July Submission861 Authors

29 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon 2.0, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon 2.0 rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51\% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon 2.0 enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, NLP datasets, evaluation methodologies, corpus creation

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English, Python

Previous URL: https://openreview.net/forum?id=EGq6tAkXc1

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: Yes, I want a different area chair for our submission

Reassignment Request Reviewers: Yes, I want a different set of reviewers

Justification For Not Keeping Action Editor Or Reviewers: The previous review round was based on a fundamental misunderstanding of our work's purpose, evaluating it as a "code evolution" benchmark when its stated goal is to test "Version-Conditioned Generation (VCG)". This incorrect framing led to invalid criticisms (e.g., that the benchmark is "not up-to-date") and a dismissal of our core argumentation. Because the original team is anchored to this initial misinterpretation, we believe a new set of reviewers is necessary to fairly assess the heavily revised manuscript on its actual merits.

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: No

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: We have an Artifacts appendix J

B2 Discuss The License For Artifacts: No

B2 Elaboration: We do not redistribute artifacts that we do not own.

B3 Artifact Use Consistent With Intended Use: N/A

B3 Elaboration: We used only publicly available model API's as artifacts, and only for their intended purpose.

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: We have an Artifacts appendix J

B6 Statistics For Data: Yes

B6 Elaboration: In section 2.

C Computational Experiments: Yes

C1 Model Size And Budget: N/A

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 3.1 Experimental setup

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 3.2 Experimental results

C4 Parameters For Packages: N/A

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 861

Loading