Group-Normalized Implicit Value Optimization for Language Models

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM post-training
Abstract: Fine-tuning Large Language Models (LLMs) with reinforcement learning (RL) has become a key technique for enhancing performance on a wide range of tasks, from user alignment to complex reasoning. However, this approach is often hindered by the difficulty of fine-grained credit assignment, as it typically relies on sparse rewards given only at the end of a completely generated sequence. Conventional solutions often require training an auxiliary value network known as critic, which introduces significant computational overhead and training instability. We present Group-Normalized Implicit Value Optimization (GN-IVO), a novel, critic-free algorithm that directly addresses this challenge. GN-IVO learns step-level values implicitly from the policy through a group-normalized distributional matching objective. This approach elegantly circumvents the need for an explicit critic and avoids the computation of the intractable partition function by normalizing values across a group of sampled model responses. Theoretically, we prove that our objective recovers the true value function up to a constant, guaranteeing that the optimal policy is preserved. We demonstrate the practical effectiveness of GN-IVO on a diverse set of text generation and reasoning tasks, showing that it consistently outperforms strong RL baselines for LLMs.
Supplementary Material: pdf
Primary Area: optimization
Submission Number: 11087
Loading