Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations

Xinnan Zhang; Chenliang Li; Siliang Zeng; Jiaxiang Li; Zhongruo Wang; Songtao Lu; Alfredo Garcia; Mingyi Hong

Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations

Xinnan Zhang, Chenliang Li, Siliang Zeng, Jiaxiang Li, Zhongruo Wang, Songtao Lu, Alfredo Garcia, Mingyi Hong

Published: 05 Mar 2025, Last Modified: 19 Mar 2025Reasoning and Planning for LLMs @ ICLR2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Inference-time Alignment; Reinforcement Learning; RLHF

Abstract: Aligning Large Language Models (LLMs) to human preferences is essential for their effective deployment in real-world applications. Traditional post-training methods, such as Reinforcement Learning with Human Feedback (RLHF), are resource-intensive and time-consuming, especially as model sizes continue to grow. Recently, inference-time alignment methods have gained significant attention, as they can steer the LLM output without direct fine-tuning, and can be integrated with post-training techniques to further enhance performance. Additionally, these methods enable personalization, allowing models to adapt dynamically to user preferences and specific task requirements. However, these approaches operate in a one-shot manner, limiting policy improvement to a single round. To address this limitation, we introduce inference-time Successive Policy Iterations (SPI), a novel algorithm that enables successive policy improvement at inference time. Specifically, inference-time SPI iteratively learns value functions and leverages them to guide the LLM through a search-based optimization process. Theoretically, our algorithm is equivalent to performing multi-iteration policy optimization on the base model, effectively improving its behavior without direct fine-tuning. Experimental results demonstrate that inference-time SPI significantly improves length-control win rates on challenging instruction-following benchmarks, such as AlpacaEval 2.0, achieving a substantial performance boost (e.g., $30.71\% \to 43.80\%$ for \texttt{Llama-3-8B-Instruct} compare against GPT-4 responses). Furthermore, inference-time SPI consistently outperforms existing test-time alignment baselines such as Best-of-N (BoN), weak to strong search, which is effective for inference time scaling on different tasks.

Submission Number: 112

Loading