[AML] A Reinforcement Learning Technique for Large Language Model on Self-Verifiable Problems

THU 2024 Winter AML Submission21 Authors

11 Dec 2024 (modified: 02 Mar 2025)THU 2024 Winter AML SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement learning; Large Language Models; Self-verification Problems
TL;DR: We propose a reinforcement learning-based approach to enhance the ability of LLM to solve problems that allow for self-verification.
Abstract: For certain types of self-verifiable problems, such as those encountered in programming competitions, game theory, and mathematical domains, an intriguing question arises: Can a large language model, guided solely by iterative attempts and performance feedback rather than human prior knowledge, incrementally refine its solutions to ultimately surpass human-level problem solving capabilities? By continuously recording both successful and unsuccessful attempts and employing these historical records as reinforcement signals, is it possible to train a model through iterative refinement and reinforcement learning to achieve expertise well beyond that of human practitioners? Building on this concept, we leverage the Codegeex4-9B model as our foundational large language model and apply a reinforcement learning framework to the self-verifiable domain of programming challenges, such as those found in NOI/ACM competitions. Our preliminary experiments show that employing a feedback-driven problem-solving strategy can improve solution success rates by approximately 5–10\% points over random trial attempts. Subsequently, we further enhance the model’s capabilities through Direct Preference Optimization (DPO)-based reinforcement learning on the recorded solution histories. Although time constraints have limited the extent of our current data, we plan to release additional results in the coming days as we continue to refine and evaluate our system.
Submission Number: 21
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview