Reinforcement Learning on Pre-Training Data

Reinforcement Learning on Pre-Training Data

ACL ARR 2026 January Submission5414 Authors

05 Jan 2026 (modified: 07 Jun 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Reinforcement Learning, Reasoning

Abstract: Recent progress in large language models (LLMs) is largely driven by scaling training compute through either pre-training with next-token prediction (NTP) or post-training with reinforcement learning (RL). The former contributes to learning broad knowledge and skills from general data, while struggling with data inefficiency and catastrophic forgetting in continual learning settings. The latter incentivizes reasoning capabilities with strong generalization, but is constrained by limited data availability due to its reliance on human annotation. To alleviate these issues, we propose Reinforcement Learning on Pre-Training data (RLPT), which combines the advantages of learning from general data and RL. In particular, RLPT derives reward signals directly from general text data through a next-segment reasoning objective, rewarding the policy for correctly predicting next text segments conditioned on the prefix text. Experiments across multiple benchmarks and models demonstrate the effectiveness of RLPT. For example, RLPT yields substantial improvements in continual pre-training ($+4.6\%$) and provides a strong foundation for post-training ($+3.4\%$) on Qwen3-8B-Base.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: reinforcement learning

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 5414

Loading