Self-Improving LLMs with Synthetic Data Through Dynamic Noise Preference Optimization

Self-Improving LLMs with Synthetic Data Through Dynamic Noise Preference Optimization

TMLR Paper5730 Authors

25 Aug 2025 (modified: 11 Nov 2025)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Although LLMs have achieved significant success, their reliance on large volumes of human-annotated data has limited their potential for further scaling. In this situation, utilizing self-generated synthetic data has become crucial for fine-tuning LLMs without extensive human annotation. However, current methods often fail to ensure consistent improvements across iterations, with performance stagnating after only minimal updates. To overcome these challenges, we introduce Dynamic Noise Preference Optimization (DNPO), which combines dynamic sample labeling for constructing preference pairs with controlled, trainable noise injection during preference optimization. Our approach effectively prevents stagnation and enables continuous improvement. In experiments with Llama-3.2-3B and Zephyr-7B, DNPO consistently outperforms existing methods across multiple benchmarks. Additionally, with Zephyr-7B, DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Kangwook_Lee1

Submission Number: 5730

Loading