TL;DR: In this paper, we proposed ConfPO, which identifies preference-critical tokens based on the training policy's confidence, thus requiring no additional models or compute for token selection.
Abstract: We introduce ConfPO, a method for preference learning in Large Language Models (LLMs) that identifies and optimizes preference-critical tokens based solely on the training policy's confidence, without requiring any auxiliary models or compute. Unlike prior Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO), which uniformly adjust all token probabilities regardless of their relevance to preference, ConfPO focuses optimization on the most impactful tokens. This targeted approach improves alignment quality while mitigating overoptimization (i.e., reward hacking) by using the KL divergence budget more efficiently. In contrast to recent token-level methods that rely on credit-assignment models or AI annotators, raising concerns about scalability and reliability, ConfPO is simple, lightweight, and model-free. Experimental results on challenging alignment benchmarks, including AlpacaEval 2 and Arena-Hard, demonstrate that ConfPO consistently outperforms uniform DAAs across various LLMs, delivering better alignment with zero additional computational overhead.
Lay Summary: Teaching large language models to generate responses that align with human preferences is a key challenge. Current training methods are often inefficient because they treat every word—from simple fillers like "the" and "a" to critical keywords—as equally important for learning. This wastes resources and can limit how much the AI's helpfulness improves.
We developed a method called ConfPO that intelligently focuses the learning process. ConfPO monitors the AI's own confidence as it generates a response; when the model is less confident about a particular word, it signals that this word is likely an important decision point. Our method directs all the training updates only to these few, high-impact words.
This focused strategy makes the learning process more efficient, resulting in higher-quality, more helpful AI responses. Crucially, ConfPO achieves this performance boost without requiring any additional complex models or extra computational power, offering a simple and effective tool for building better language technologies.
Primary Area: Deep Learning->Large Language Models
Keywords: RLHF, DAA, overoptimization
Submission Number: 12507
Loading