Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming

Tingqiang Xu; Hangrui Zhou; Tianle Cai; Alex Gu; Kaifeng Lyu

Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming

Tingqiang Xu, Hangrui Zhou, Tianle Cai, Alex Gu, Kaifeng Lyu

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite strong performance in competitive programming, the role of Large Language Models (LLMs) in supporting human learning in the same setting remains largely unexplored. In this work, we introduce **UOJ-Bench**, a benchmark designed to evaluate not only the problem-solving ability of LLMs, but also their ability to identify errors in human-written code—a crucial educational activity traditionally supported by running test cases over online judge systems. UOJ-Bench consists of three distinct tasks: code generation, code hacking, and code repair, all constructed from real-world code submissions on the Universal Online Judge (UOJ) and evaluated through UOJ's native judging infrastructure. Our results show that under one-shot evaluation, even the strongest models fail to identify errors in more than 50\% of a set of submissions that have been found to be incorrect by UOJ users. While test-time scaling improves success rates to above 90\%, the substantial computational costs incurred from model inference limit its practicality for large-scale deployment. Despite these limitations, we find that the best-performing models under test-time scaling can uncover errors in over 5\% of full-score submissions across roughly 30 problems, suggesting that frontier LLMs can already provide complementary signals beyond standard judging systems. UOJ-Bench is publicly available at https://github.com/hehezhou/UOJ-Bench.

Lay Summary: Large Language Models (LLMs) perform well on competitive programming tasks, but it is still unclear whether they can help people learn programming by identifying mistakes in human-written code. Existing benchmarks mainly evaluate whether models can solve problems themselves, rather than whether they can support educational activities such as debugging and reviewing submissions. To address this gap, we introduce UOJ-Bench, a benchmark built from real competitive programming submissions from the Universal Online Judge (UOJ). UOJ-Bench evaluates three abilities: generating code, finding hidden bugs in incorrect submissions, and repairing faulty programs. All tasks are evaluated using UOJ’s real judging system, making the benchmark closer to practical programming environments. Our results show that even the strongest models miss errors in more than half of incorrect submissions when given only one attempt. While additional computation greatly improves performance, the cost is too high for large-scale deployment. At the same time, advanced models can already identify mistakes in some submissions that were originally accepted by the judge. These findings highlight both the potential and the current limitations of using LLMs as programming tutors and debugging assistants in competitive programming education.

Link To Code: https://github.com/hehezhou/UOJ-Bench

Primary Area: Deep Learning->Large Language Models

Keywords: Large Language Models, Competitive Programming, Automated Program Repair, Test Case Generation, Benchmark

Originally Submitted PDF: pdf

Submission Number: 31631

Loading