Review of Reinforcement Learning for Large Language Models: Formulations, Algorithms, and Opportunities

Review of Reinforcement Learning for Large Language Models: Formulations, Algorithms, and Opportunities

30 Sept 2025 (modified: 23 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) represent significant milestones in artificial intelligence development. While pre-training on vast text corpora and subsequent supervised fine-tuning establish their core abilities, Reinforcement Learning (RL) has emerged as an indispensable paradigm for refining LLMs, particularly in aligning them with human values, and teaching them to reason and follow complex instructions. As this field evolves rapidly, this survey offers a systematic review of RL methods for LLMs, with a focus on fundamental concepts, formal problem settings, and the main algorithms adapted to this context. Our review critically examines the inherent computational and algorithmic challenges arising from the integration of RL with LLMs, such as scalability issues, effective gradient estimation, and training efficiency. Concurrently, we highlight exciting opportunities for advancing LLM capabilities through new RL strategies, including multi-modal integration and the development of agentic LLM systems.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: - Added Empirical Performance Evidence (Section 3.1, Page 17): Introduced a new Figure 7 and quantitative data comparing algorithms, showing REINFORCE-style methods have strong performance as PPO. - Reframed PPO Complexity Discussion (Section 3.1, Page 16): Revised the "From PPO to REINFORCE" paragraph to focus on PPO's computational trade-offs, removing language that could be misconstrued as dismissing its theoretical foundation. - Expanded Process Rewards Coverage (Section 2.2, Page 10): Added a new dedicated subsection, "Process Reward Models: A Comparative Perspective," to provide a comparative analysis of process vs. outcome supervision. - Added Broader Impact Statement (Section 7, Page 29): Created a new dedicated section addressing safety risks, capability trade-offs, and resource costs, complementing related discussions in Section 2.3. - Enhanced Offline Algorithms Coverage (Section 3.2, Page 22): Expanded the section to include IPO and KTO, and added a new practical guidance subsection, "When to Prefer Offline vs. Online Methods." - Quantified Systems Discussion (Section 4.3, Pages 25-26): Enhanced the "Training-Inference Mismatch" part with specific performance numbers and added a new subsection, "Reward Model MLOps and Reward Serving." - Specified Future Directions (Section 5, Pages 28-29): Substantially expanded the section with concrete technical challenges and benchmarks for multi-modal RL and tool usage. - Addressed Reviewer Gd7K's Feedback (Figure 6 & Figure 8): Corrected the meaning of "no epistemic uncertainty" in the Figure 6 caption and fixed the "proabability" typo in Figure 8.

Assigned Action Editor: ~Kamil_Ciosek1

Submission Number: 6043

Loading