Keywords: Benchmarks, Code Reasoning, Reasoning Language Model.
Abstract: Recent advancements in large language models (LLMs) have demonstrated remarkable progress in mathematical and coding reasoning. However, existing code benchmarks are limited in their ability to evaluate the full spectrum of these capabilities, especially at the level of top-tier human programming competitions. To bridge this gap, we introduce **OJBench**, a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of LLMs. OJBench comprises 232 programming competition problems from NOI and ICPC, providing a rigorous test of models' reasoning skills. We conducted a comprehensive evaluation of 37 models on OJBench, including a mix of closed-source, open-source, reasoning-oriented, and general-purpose models. Our results indicate that even state-of-the-art reasoning models like o4-mini and Gemini-2.5-pro-exp struggle with highly challenging, competition-level problems, highlighting the significant challenges models face in this domain.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 22620
Loading