Keywords: SubtaskEval, Code generation, Competitive programming, Large language models, Subtask evaluation, Benchmarking, Datasets
Abstract: Existing code generation benchmarks such as HumanEval, MBPP, and LiveCodeBench evaluate only full solutions, overlooking meaningful partial progress on competitive programming tasks. We introduce **SubtaskEval**, a benchmark of 287 olympiad problems (2017–2025) that preserves official subtask structures, metadata, and online-judge links. Evaluating six recent LLMs, including a tool-augmented variant, we find that even the best model achieves only 18.47\% accuracy (pass@1) though tool use improves subtask performance. Models exhibit bottom-heavy score distributions, in contrast to the more balanced distributions of human contestants. Subtask-based evaluation thus provides a finer-grained view of model problem-solving and highlight directions for advancing LLMs in code generation.
Supplementary Material: zip
Submission Number: 54
Loading