Scalable Valuation of Human Feedback through Provably Robust Model Alignment

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Direct Preference Optimization, Robustness, Label Flip, Valuation of Human Feedback
TL;DR: We propose Hölder-DPO, the first alignment method with a provable redescending property, which enables robust learning from noisy human feedback by identifying and correcting mislabeled data, improving alignment and model performance.
Abstract: Despite the importance of aligning language models with human preferences, crowd-sourced human feedback is often noisy---for example, preferring less desirable responses---posing a fundamental challenge to alignment. A truly robust alignment objective should yield identical model parameters even under severe label noise, a property known as redescending. We prove that no existing alignment methods satisfy this property. To address this, we propose Hölder-DPO, the first principled alignment loss with a provable redescending property, enabling estimation of the clean data distribution from noisy feedback. The aligned model estimates the likelihood of clean data, providing a theoretically grounded metric for dataset valuation that identifies the location and fraction of mislabels. This metric is gradient-free, enabling scalable and automated human feedback valuation without costly manual verification or clean validation dataset. Hölder-DPO achieves state-of-the-art robust alignment performance while accurately detecting mislabels in controlled datasets. Finally, applied to Anthropic HH-RLHF dataset, it reveals substantial noise levels and removing these mislabels significantly improves alignment performance across methods. The code is available at https://github.com/ma921/HolderDPO.
Supplementary Material: zip
Primary Area: Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)
Submission Number: 8484
Loading