TL;DR: We propose an iterative alignment framework that mitigates the impact of preference noise by effectively identifying and filtering noisy samples.
Abstract: The prevalent noise in the preference data unavoidably poses significant challenges to the preference alignment of large language models (LLMs). Existing efforts for this problem either marginally alleviate the impact of noise without noise reduction, or rely on external LLMs that incur substantial computational costs. To address these challenges, we propose **RO**bust **P**reference **O**ptimization (**ROPO**), an iterative alignment approach that integrates *noise-tolerance* and *noise filtering* without the aid of external models. Specifically, ROPO first formulates the training process with adaptive noise reduction as an optimization problem, which can be efficiently solved in an iterative paradigm. Then, to equip this solving process with noise-tolerance and noise-identification capabilities, we derive a robust loss that suppresses the gradients from samples with high uncertainty. We demonstrate both empirically and theoretically that the derived loss is key to the noise-tolerance and effective filtering of noisy samples. The derived loss further inspires a robustness-guided rejection sampling technique to compensate for the potential important information in discarded queries. Extensive experiments on several widely-used datasets and model architectures demonstrate that ROPO significantly outperforms all baselines under **four** practical noise settings and the random symmetric noise, with its advantage increasing as the noise rate increases.
Lay Summary: Large language models (LLMs) often learn from human preference data, but this data is frequently noisy---containing mistakes or inconsistencies---which makes it hard to align models effectively. Existing solutions either ignore the noise or rely on expensive external models to clean it. Our research introduces ROPO, a new method that trains LLMs to handle noisy preference data efficiently and accurately, without using any external models. ROPO combines two key ideas: it actively filters out noisy data during training, and it reduces the influence of uncertain samples through a robust loss function. This loss helps the model focus on reliable signals while still preserving potentially useful information from filtered data through a guided sampling strategy. ROPO works iteratively, refining the model and the data quality at the same time. We prove that our approach is both theoretically sound and practically effective. Experiments show that ROPO consistently outperforms existing methods, especially when the noise is severe. This makes ROPO a valuable tool for training more reliable and efficient LLMs in real-world settings.
Primary Area: Deep Learning->Large Language Models
Keywords: noise tolerance, large language models, preference optimization
Submission Number: 11523
Loading