Improve LLM Pre-training with RL-Guided Annealing

Junjie Huang; Jiarui Qin; di yin; Weiwen Liu; Yong Yu; Xing Sun; Weinan Zhang

Improve LLM Pre-training with RL-Guided Annealing

Junjie Huang, Jiarui Qin, di yin, Weiwen Liu, Yong Yu, Xing Sun, Weinan Zhang

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: pre-training, annealing, token-level reweighting, synergy between pre-training and post-training

TL;DR: RL-Guided Annealing (RGA) reweights tokens via an RL reference during annealing, boosting pre/post-training without extra teachers.

Abstract: Training large language models (LLMs) typically proceeds in two distinct stages: pre-training and post-training. However, the question of how to exploit these stages synergistically—particularly how post-trained models can inform and improve pre-training—remains underexplored. We begin by analyzing training dynamics and identify the annealing (mid-training) phase as a critical turning point for the pre-trained base model’s capabilities. During this stage, high-quality corpora are introduced under a rapidly decaying learning rate, leading to a substantial shift in the base model’s probability distribution and a noticeable surge in performance. Interestingly, while reinforcement learning (RL) during post-training induces only minor distributional shifts, it significantly enhances reasoning capabilities. Motivated by this observation, we propose RL-Guided Annealing (RGA), a method designed to leverage RL-enhanced models, naturally produced during standard LLM training pipeline, to guide token weighting during the annealing phase. Specifically, RGA transfers knowledge from the RL stage back to annealing by reassigning token-level importance weights based on the per-token loss differences between the base and RL models. Notably, RGA does not require any specially trained teacher or reference model. Across multiple model families, RGA consistently improves performance, achieving average gains of 5.21\%, 1.84\%, and 1.78\% on 10 pre-training benchmarks. It also boosts downstream performance after post-training by over 2\%. These findings reveal a positive feedback loop between pre-training and post-training: RL-tuned models retroactively improve their foundational base models, which in turn support more effective RL—enabling a self-reinforcing path toward higher model quality.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 5418

Loading