Keywords: reasoning models, language models, coding benchmark
TL;DR: A new reasoning benchmark focused on a real-world coding task
Abstract: Leveraging reasoning models for coding tasks is desirable, as reasoning models would leverage capabilities such as multi-hop reasoning to better understand code with missing context and consequently better understand user instructions. However, it is difficult to directly measure a model's reasoning capabilities, and existing coding benchmarks often struggle to evaluate models on realistic tasks. This paper introduces the ``Pull Requests with Bugs" benchmark that isolates a multi-hop reasoning task for 60 code review questions sourced from real-world GitHub code. Model evaluations between Claude 3.7 and 4 reasoning models show a significant performance improvement of up to 25\% with respect to non-reasoning Llama models. This reinforces this benchmark's ability to measure reasoning capabilities in a real-world coding task, while highlighting that even frontier reasoning models still have a gap to bridge with respect to demonstrating generalizable multi-hop reasoning capabilities.
Submission Number: 286
Loading