Learning Dynamics of VLM Finetuning: Cooling-Weighted DPO with Mixed Negatives

02 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision–Language Models, Preference Optimization, Reinforcement Learning from Human Feedback, Stable Fine-tuning
TL;DR: We address instability in VLM preference finetuning by introducing CW-DPO, a two-stage method that smooths training with constrained SFT and adaptively down-weights easy negatives, yielding stable, calibrated, and stronger performance.
Abstract: Preference-based finetuning of vision--language models (VLMs) is brittle: trivially wrong negatives inject uninformative gradients that destabilize training. We recast alignment as \textbf{learning-dynamics--aware optimization} and introduce Cooling-Weighted DPO (CW-DPO), a two-stage recipe that explicitly models and exploits the training trajectory. Stage 1 performs supervised finetuning with **gentle negatives**, i.e., **low-weight smoothed supervision** that regularizes the base policy and curbs overconfidence without explicit penalties. Stage 2 applies a DPO objective in which the **negative term is scaled by a cooling weight** computed from the model's **average token log-probability** on each negative, suppressing uninformative gradients from easy or off-distribution samples while preserving signal from hard negatives. In practice, we emphasize **on-policy negatives** and allow **mixed negatives** by blending a controllable fraction of dataset negatives to maintain contrast freshness. Throughout, we instrument training with $\Delta\!\log p$ probes on positives and negatives as first-class signals for early stopping, curriculum design, and failure diagnosis. Across diverse VLM tasks, CW-DPO yields more stable optimization, better calibration, and higher pairwise win-rates than SFT-only and vanilla DPO, while converging in fewer steps. Ablations isolate our cooling-weight mechanism as the primary driver of these gains and show complementary benefits from mixing on-policy and dataset negatives. Taken together, our results show that smoothing learning dynamics before cooling preferences is a simple, general principle for robust VLM alignment.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 819
Loading