OJBench: A Competition Level Code Benchmark For Large Language Models

ICLR 2026 Conference Submission22620 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmarks, Code Reasoning, Reasoning Language Model.
Abstract: Recent advancements in large language models (LLMs) have demonstrated remarkable progress in mathematical and coding reasoning. However, existing code benchmarks are limited in their ability to evaluate the full spectrum of these capabilities, especially at the level of top-tier human programming competitions. To bridge this gap, we introduce **OJBench**, a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of LLMs. OJBench comprises 232 programming competition problems from NOI and ICPC, providing a rigorous test of models' reasoning skills. We conducted a comprehensive evaluation of 37 models on OJBench, including a mix of closed-source, open-source, reasoning-oriented, and general-purpose models. Our results indicate that even state-of-the-art reasoning models like o4-mini and Gemini-2.5-pro-exp struggle with highly challenging, competition-level problems, highlighting the significant challenges models face in this domain.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 22620
Loading