Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach

ICLR 2026 Conference Submission14291 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Inference-time alignment; Reinforcement Learning; RLHF
Abstract: Aligning large language models (LLMs) with human preferences usually requires fine-tuning methods such as RLHF and DPO. These methods directly optimize the model parameters, so they cannot be used in inference time to improve model performance, nor are they applicable when the model weights are not accessible. In contrast, test‑time methods sidestep weight updates by leveraging reward functions to guide and improve output quality. However, they incur high inference costs, and their one-shot guidance is often based on imperfect reward or value functions, leading to suboptimal outputs. In this work, we present a method named Iterative Reweight-then-Optimize (IRO), a reinforcement learning (RL) framework that performs RL-style alignment of the (frozen) base model without touching its parameters. During training, each iteration {\it (i)} samples candidates from the base model, {\it (ii)} resamples using current value functions, and {\it (iii)} trains a new lightweight value function that guides the next decoding pass. At inference time, the value functions are used to guide the base model generation via a search-based optimization process. We prove that under some mild conditions, IRO is a kind of policy iteration and attains the performance of Best‑of‑N (BoN) search with exponentially fewer tokens at inference time. Experimental results demonstrate that IRO significantly improves length-controlled win rates on challenging instruction-following benchmarks, such as AlpacaEval 2.0, achieving a substantial performance boost (e.g., $30.71\% \to 43.80\%$ for \texttt{Llama-3-8B-Instruct} and $43.11\% \to 49.77\%$ for $\texttt{Llama-3-70B-Instruct}$ compared against GPT-4 responses). Further, IRO consistently outperforms SOTA inference-time alignment baselines such as BoN and weak-to-strong search, even when using much smaller value functions (of size 1B or 7B) to guide a large base model (of size 6.9B or 70B).
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14291
Loading