FOE-RL: Flexible Online Reinforcement Learning for Efficient Inference in Large Language Models

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning; Efficient Reasoning; LLMs
Abstract: Recent advancements in large reasoning models have significantly enhanced their reasoning abilities. However, recent studies have shown that these models often experience "overthinking," even when handling relatively simple questions. In this paper, we propose a flexible online reinforcement learning method that estimates the difficulty of a problem in real-time and predicts an appropriate output length. Based on this, we design a length reward function and a flexible reward trend monitor, which dynamically activates or deactivates the length reward according to smoothed correctness rewards. Experimental results demonstrate the effectiveness of our approach. Compared to training methods that rely solely on correctness rewards, our approach significantly improves model accuracy while substantially reducing the average response length. On the MATH dataset, our method reduces the output token count by over 40% and increases accuracy by more than 4%. Across multiple testing benchmarks, it maintains or even enhances model performance while consistently lowering token usage. Furthermore, we observe that the method exhibits a self-regulating output length capability: depending on the model’s own capacity and question difficulty, it automatically converges toward an optimal output length range, achieving higher accuracy in the process.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4629
Loading