Beyond Hard Supervised Fine-tuning: Enhancing Image-text Alignment of Strong Models with Weak Models

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Image-text alignment, weak-to-strong supervision
Abstract: Image–text alignment models, such as CLIP, are typically trained with large-scale contrastive learning: paired data are treated as positives, while all unpaired pairs are treated as negatives. However, this hard supervision overlooks the fact that some unpaired pairs are semantically related rather than irrelevant, and penalising them as strict negatives introduces noise that limits model performance. We propose Permute-then-Adapt (PTA), a weak-to-strong supervision framework that addresses this issue. PTA comprises two key innovations: (1) a permutation-based thresholding that identifies and filters unreliable negatives by estimating a null distribution of similarities, and (2) a soft supervision strategy that leverages above-threshold similarities to provide extra training signals. Across benchmarks, PTA consistently improves the alignment ability of strong models on object recognition and cross-modal retrieval.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 23917
Loading