【Proposal】RLOJF: Enhancing LLMs in Olympiad Programming with Online Judge Feedback

20 Oct 2024 (modified: 05 Nov 2024)THU 2024 Fall AML SubmissionEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: Large Language Models, Olympiad Programming, Reinforcement Learning, Online Judge, Code Generation
Abstract: Large Language Models (LLMs) have achieved significant success in programming tasks, particularly excelling on interview-oriented platforms like LeetCode. However, we observe that these models still underperform on more complex problems in Olympiad level competitions. This performance gap primarily stems from the deep mathematical reasoning, complex algorithmic thinking, and diverse solution strategies required for Olympiad programming problems. To address this issue, we propose a novel approach: Reinforcement Learning with Online Judge Feedback (RLOJF). This method simulates the iterative process in real programming environments, allowing the model to dynamically adjust its output based on scores and error messages provided by Online Judge (OJ) systems. RLOJF aims to improve the correctness and efficiency of code generated by the model, develop its ability to iteratively refine code using automated feedback, and enhance its reasoning and problem-solving capabilities in complex programming tasks. Our research contributions include: proposing a new reinforcement learning framework for complex programming tasks, designing a training methodology utilizing OJ feedback, conducting extensive experiments on a large number of complex programming problems to validate the method's effectiveness.
Submission Number: 15
Loading