GIFT: Group-relative Implicit Fine Tuning Integrates GRPO, DPO and UNA

ACL ARR 2026 January Submission4352 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Fine tuning, GRPO, DPO, UNA
Abstract: I propose \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning (GIFT), a novel reinforcement learning framework for aligning LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT minimizes the discrepancy between implicit and explicit reward models. It combines three key ideas: (1) the online multi-response generation and normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the implicit–explicit reward alignment principle of UNA. By jointly normalizing the implicit and explicit rewards, GIFT eliminates an otherwise intractable term that prevents effective use of implicit rewards. This normalization transforms the complex reward maximization objective into a simple mean squared error (MSE) loss between the normalized reward functions, converting a non-convex optimization problem into a convex, stable, and analytically differentiable formulation. Despite using a supervised-style MSE loss, GIFT remains a policy optimization method: it optimizes the policy under the same KL-regularized PPO-style objective as RLHF and GRPO, but replaces direct reward maximization with normalized reward matching. Unlike offline methods such as DPO and UNA, GIFT remains on-policy and thus retains exploration capability. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Empirically, GIFT achieves superior reasoning and alignment performance on mathematical and knowledge benchmarks while remaining computationally efficient.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: LLM, Fine tuning, GRPO, DPO, UNA
Contribution Types: Theory
Languages Studied: English
Submission Number: 4352
Loading