MetroRLHF: Enabling Memory-Effective Training for On-Policy RLHF via Adaptive Sequence Streaming

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER Workshop SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: RLHF, LLM, post-training
Abstract: Reinforcement learning from human feedback (RLHF) has become the standard post‑training technique for endowing large language models (LLMs) with helpful, harmless, and intent‑consistent behavior. In practice, however, its adoption is hampered by prohibitive memory consumption during the phase of the policy‑model update, especially when training on long‑form generation tasks. In this paper, we propose MetroRLHF, a memory‑efficient, on‑policy RLHF approach that exploits the inference-time computations to reduce the training-time memory budget and to skip unnecessary work. By re‑using the inference-phase materialized $K, V$ context, the inter‑token dependencies are freely removed that normally force the entire sequence to train in parallel. Building upon fine‑grained subsequence streaming, RLHF can train the productive tokens in an effective manner. This yields a training pipeline that matches the exact behavior of conventional full‑sequence RLHF while using less memory and incurring no arithmetic recomputation. Experiments on the Qwen‑3 models demonstrate that MetroRLHF rescheduled algorithm reduces peak training memory usage to 1/3.8 to 1/5.9, enabling not only memory-effective but also semantic-reliable fine‑tuning for LLM.
Submission Number: 22
Loading