Abstract: The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon 2.0, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon 2.0 rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51\% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon 2.0 enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, NLP datasets, evaluation methodologies, corpus creation
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English, Python
Previous URL: https://openreview.net/forum?id=EGq6tAkXc1
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: Yes, I want a different set of reviewers
Justification For Not Keeping Action Editor Or Reviewers: The previous review round was based on a fundamental misunderstanding of our work's purpose, evaluating it as a "code evolution" benchmark when its stated goal is to test "Version-Conditioned Generation (VCG)". This incorrect framing led to invalid criticisms (e.g., that the benchmark is "not up-to-date") and a dismissal of our core argumentation. Because the original team is anchored to this initial misinterpretation, we believe a new set of reviewers is necessary to fairly assess the heavily revised manuscript on its actual merits.
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: No
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: We have an Artifacts appendix J
B2 Discuss The License For Artifacts: No
B2 Elaboration: We do not redistribute artifacts that we do not own.
B3 Artifact Use Consistent With Intended Use: N/A
B3 Elaboration: We used only publicly available model API's as artifacts, and only for their intended purpose.
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: We have an Artifacts appendix J
B6 Statistics For Data: Yes
B6 Elaboration: In section 2.
C Computational Experiments: Yes
C1 Model Size And Budget: N/A
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 3.1 Experimental setup
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 3.2 Experimental results
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 861
Loading