SubtaskEval: Benchmarking LLMs on Competitive Programming Subtasks

Samik Goyal

SubtaskEval: Benchmarking LLMs on Competitive Programming Subtasks

Samik Goyal

Published: 22 Sept 2025, Last Modified: 25 Nov 2025DL4C @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: SubtaskEval, Code generation, Competitive programming, Large language models, Subtask evaluation, Benchmarking, Datasets

Abstract: Existing code generation benchmarks such as HumanEval, MBPP, and LiveCodeBench evaluate only full solutions, overlooking meaningful partial progress on competitive programming tasks. We introduce **SubtaskEval**, a benchmark of 287 olympiad problems (2017–2025) that preserves official subtask structures, metadata, and online-judge links. Evaluating six recent LLMs, including a tool-augmented variant, we find that even the best model achieves only 18.47\% accuracy (pass@1) though tool use improves subtask performance. Models exhibit bottom-heavy score distributions, in contrast to the more balanced distributions of human contestants. Subtask-based evaluation thus provides a finer-grained view of model problem-solving and highlight directions for advancing LLMs in code generation.

Supplementary Material: zip

Submission Number: 54

Loading