Keywords: Large Language Model, Reinforcement Learning, Reasoning
Abstract: Recent progress in large language models (LLMs) is largely driven by scaling training compute through either pre-training with next-token prediction (NTP) or post-training with reinforcement learning (RL). The former contributes to learning broad knowledge and skills from general data, while struggling with data inefficiency and catastrophic forgetting in continual learning settings. The latter incentivizes reasoning capabilities with strong generalization, but is constrained by limited data availability due to its reliance on human annotation. To alleviate these issues, we propose Reinforcement Learning on Pre-Training data (RLPT), which combines the advantages of learning from general data and RL. In particular, RLPT derives reward signals directly from general text data through a next-segment reasoning objective, rewarding the policy for correctly predicting next text segments conditioned on the prefix text. Experiments across multiple benchmarks and models demonstrate the effectiveness of RLPT. For example, RLPT yields substantial improvements in continual pre-training ($+4.6\%$) and provides a strong foundation for post-training ($+3.4\%$) on Qwen3-8B-Base.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: reinforcement learning
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 5414
Loading