Research Area: Evaluation, Inference algorithms for LMs, LMs with tools and code
Keywords: coding, LLM, memory, retrieval
TL;DR: We present a new challenging benchmark of olympiad-level programming problems, evaluate existing and newly enabled inference methods that boost performance, and provide insights on the different capabilities of models on USACO.
Abstract: Olympiad programming is one of the hardest reasoning challenges for humans, yet it has been understudied as a domain to benchmark language models (LMs). In this paper, we introduce the USACO benchmark with 307 problems from USA Computing Olympiad contests, along with high-quality unit tests, reference code, and official analysis for each problem. These resources enable us to construct and test a range of LM inference methods beyond zero-shot prompting for competitive programming. We find state-of-the-art models in code generation, such as GPT-4, achieve only a 8.7\% pass@1 accuracy with zero-shot chain-of-thought prompting, with our best inference method almost \textit{doubling} zero-shot accuracy using a novel combination of retrieval augmentation and self-reflection. However, this is still far from solving the benchmark. To better understand the remaining challenges, we perform a novel human-in-the-loop study, and surprisingly find that a small number of targeted hints enable GPT-4 to solve 13 out of 15 problems previously unsolvable by any model and method. Our benchmark, baseline methods, quantitative results, and qualitative analysis thus serve as an initial step towards LMs with grounded, creative, and algorithmic reasoning.
Supplementary Material: zip
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 976
Loading