Go4RL: Improving the Pre-training Data Mixture of Large Language Models for Enhancing Reinforcement Learning

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data mixture, Reinforcement learning, Large language models
TL;DR: We propose Go4RL, which seeks to answer the above question by investigating how pre-training dataset mixture strategies relate to a viable base model for RL training
Abstract: The development of reinforcement learning (RL)-trained large language models (LLMs) such as OpenAI-o1 and DeepSeek-R1 has demonstrated notable advancement in solving complex logical tasks involving mathematics and programming. Existing studies, however, show that the reasoning capability of RL-trained LLMs varies across base model families, raising the critical research question: what base model is suitable for augmenting RL's reasoning capabilities? Recent prevailing research shows that 1) the reasoning capacity of the RL-trained model remains bounded by that of its base model and 2) the data mixture used for pre-training has a significant impact on the base model’s performance. Building on these insights, we propose Go4RL, which seeks to answer the above question by investigating how pre-training dataset mixture strategies relate to a viable base model for RL training. Go4RL first defines the measurement of base models’ reasoning capability boundaries as the average perplexity scores of the base LLM on the RL-generated $k$ responses, denoted by avg(k-ppl). Then we formulate finding the optimal pre-training dataset mixing ratios for RL as a regression task, using the proposed avg(k-ppl) as the fitting objective instead of the traditional language modeling loss. Finally, we use the learned regression model to predict the performance of unseen mixtures and apply the best predicted mixture to train a large-scale model that achieves better RL performance. We train both pre-trained and RL-trained proxy models with 1M parameters for regression fitting and then scale up to 1B parameters using various data mixtures to validate Go4RL. The experimental results on both online and offline RL algorithms show that the optimized data mixture predicted by Go4RL yields a better base model for RL training.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 8804
Loading