[AML]RLOJF: Reinforcement Learning for Enhanced Code Synthesis using Online Judge Feedback

[AML]RLOJF: Reinforcement Learning for Enhanced Code Synthesis using Online Judge Feedback

THU 2024 Winter AML Submission13 Authors

11 Dec 2024 (modified: 02 Mar 2025)THU 2024 Winter AML SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Olympiad Programming, Reinforcement Learning, Online Judge, Code Generation

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, achieving near-human performance on standard programming tasks. However, these models still struggle with complex algorithmic problems found in competitive programming environments like the International Olympiad in Informatics (IOI), where success requires sophisticated mathematical reasoning and algorithmic optimization. To address this challenge, we introduce Reinforcement Learning with Online Judge Feedback (RLOJF), a novel framework that enables LLMs to learn from rapid execution feedback through a high-performance distributed evaluation system. RLOJF combines supervised fine-tuning with proximal policy optimization, utilizing a hierarchical reward mechanism that balances code correctness, efficiency, and quality. Our framework advances the state-of-the-art through a two-phase training strategy that establishes strong baseline capabilities before optimization through feedback, a sophisticated reward design that prevents policy collapse while encouraging code improvement, and a distributed evaluation architecture that reduces feedback latency from minutes to seconds. We evaluate RLOJF on a dataset of 1,280 competitive programming problems, demonstrating a significant improvement in solution quality, with average pass@1 rate increasing from 48\% to 81\%. Comprehensive ablation studies demonstrate the complementary benefits of supervised fine-tuning and proximal policy optimization, with SFT excelling at code structure and documentation while PPO significantly improves runtime accuracy and execution success rates. These results suggest promising directions for applying reinforcement learning to complex algorithmic programming tasks.

Submission Number: 13

Loading