OJBench: A Competition Level Code Benchmark For Large Language Models

Zhexu Wang; Yiping Liu; Yejie Wang; Wenyang He; Bofei Gao; Muxi Diao; Yanxu Chen; Kelin Fu; Flood Sung; Zhilin Yang; Tianyu Liu; Weiran Xu

OJBench: A Competition Level Code Benchmark For Large Language Models

Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao, Muxi Diao, Yanxu Chen, Kelin Fu, Flood Sung, Zhilin Yang, Tianyu Liu, Weiran Xu

20 Sept 2025 (modified: 04 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmarks, Code Reasoning, Reasoning Language Model.

Abstract: Recent advancements in large language models (LLMs) have demonstrated remarkable progress in mathematical and coding reasoning. However, existing code benchmarks are limited in their ability to evaluate the full spectrum of these capabilities, especially at the level of top-tier human programming competitions. To bridge this gap, we introduce **OJBench**, a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of LLMs. OJBench comprises 232 programming competition problems from NOI and ICPC, providing a rigorous test of models' reasoning skills. We conducted a comprehensive evaluation of 37 models on OJBench, including a mix of closed-source, open-source, reasoning-oriented, and general-purpose models. Our results indicate that even state-of-the-art reasoning models like o4-mini and Gemini-2.5-pro-exp struggle with highly challenging, competition-level problems, highlighting the significant challenges models face in this domain.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 22620

Loading