The Code Review Comprehension Assessment for Language Models

The Code Review Comprehension Assessment for Language Models

ACL ARR 2025 February Submission8228 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: State-of-the-art language models have demonstrated impressive code generation capabilities but struggle with real-world software engineering tasks like code reviewing, hindering practical use. Review comments are often implicit, ambiguous, and colloquial, requiring models to grasp both code and human intent. This challenge calls for evaluating language models' ability to bridge technical and conversational contexts. While existing work has employed the automated code refinement task to resolve these comments, current evaluation methods fall short, relying on metrics that provide limited insight into model failures and remain susceptible to training data contamination. To address these limitations, we introduce a novel evaluation benchmark CodeReviewQA that enables us to conduct fine-grained assessment of model capabilities and mitigate data contamination risks. In CodeReviewQA, we decompose the generation task in code refinement into three essential reasoning steps: change type recognition, change localisation, and solution identification. Each step is reformulated as multiple-choice questions with varied difficulty levels, enabling precise assessment of model capabilities while mitigating data contamination risks. Our comprehensive evaluation spans 65 recently released large language models on 900 manually curated, high-quality examples across nine programming languages. Our results show that CodeReviewQA is able to reveral model capability gaps in different reasoning tasks, and expose model weaknesses.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Generation,Interpretability and Analysis of Models for NLP,NLP Applications,Question Answering,Resources and Evaluation

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English,Progamming Language

Submission Number: 8228

Loading