AlignDiff: Exploiting Model-Intrinsic Information for Better Preference Data Selection

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data Filtering, Preference Alignment
Abstract: Aligning large language models with human preferences remains challenging, and the quality of preference data is critical for effective alignment. Existing large-scale datasets often introduce noise and distribution shifts, limiting model performance. To address this, we propose AlignDiff, a preference data filtering framework driven by intrinsic model signals. AlignDiff first identifies samples with clear preferences using both positive and inverse signals, then prioritizes the more challenging samples based on the average negative log-likelihood gap, encouraging the model to learn richer information from them. Across multiple models and benchmarks, AlignDiff consistently outperforms the other seven baselines. On AlpacaEval 2.0, training on only 50\% of the data selected by AlignDiff nearly doubles the performance of LLaMA-3-8B-SFT compared to training on the full dataset. The data filtered by AlignDiff preserves the length gap distribution while achieving a more favorable external reward margins distribution, and difficulty-based curriculum learning further enhances model performance.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17729
Loading