AME: ALIGNED MANIFOLD ENTROPY FOR ROBUST UNSUPERVISED VISION-LANGUAGE DISTILLATION

11 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Language Model, Unsupervised, Knowledge Distillation, Information Entropy
Abstract: Unsupervised knowledge distillation is a long-established technique for knowledge transfer, and has regained attention in the emergence of pre-trained vision-language models (VLMs). However, its objectives, i.e, probability distributions, are inherently scalar and directionless, which represent a sharp contrast to the similarity-based objectives employed in vision–language model training. As a result, the unsupervised distillation paradigm fails to impose sufficient cross-modal alignment, such alignment is essential for generalization in vision–language knowledge distillation. To address this major challenge arising from the representation misalignment, we propose Aligned Manifold Entropy (AME) for robust unsupervised vision-language distillation (AME), aiming to achieve robust generalization in vision-language distillation tasks. Specifically, AME performs entropy compression over a restructured shared manifold (RSM), where multi-modal inputs (images and texts) are jointly embedded through projection functions. Here, we embed the features into a compact structure through representational compression, which in turn enforces directional alignment within the representation space. Note that AME keep the original backbone architecture without the need for additional modules. Thus, the proposed AME establishes a paradigm that effectively reinstates directional alignment and significantly improve representation convergence in low-data regimes. Extensive experiments and theoretical analysis across a wide range of settings demonstrate that AME is consistently conducive to robust unsupervised knowledge distillation, resulting in superior generalization across 11 datasets. Clearly, AME is a principled paradigm for unsupervised vision-language distillation, which advances it into a broader range of downstream tasks.
Supplementary Material: pdf
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 3958
Loading