Pull Requests with Bugs: Benchmarking Model Reasoning for Code Reviews

Laurence Liang

Pull Requests with Bugs: Benchmarking Model Reasoning for Code Reviews

Laurence Liang

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reasoning models, language models, coding benchmark

TL;DR: A new reasoning benchmark focused on a real-world coding task

Abstract: Leveraging reasoning models for coding tasks is desirable, as reasoning models would leverage capabilities such as multi-hop reasoning to better understand code with missing context and consequently better understand user instructions. However, it is difficult to directly measure a model's reasoning capabilities, and existing coding benchmarks often struggle to evaluate models on realistic tasks. This paper introduces the ``Pull Requests with Bugs" benchmark that isolates a multi-hop reasoning task for 60 code review questions sourced from real-world GitHub code. Model evaluations between Claude 3.7 and 4 reasoning models show a significant performance improvement of up to 25\% with respect to non-reasoning Llama models. This reinforces this benchmark's ability to measure reasoning capabilities in a real-world coding task, while highlighting that even frontier reasoning models still have a gap to bridge with respect to demonstrating generalizable multi-hop reasoning capabilities.

Submission Number: 286

Loading