Information-Theoretic Criteria for Knowledge Distillation in Multimodal Learning

Information-Theoretic Criteria for Knowledge Distillation in Multimodal Learning

ICLR 2026 Conference Submission17502 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Cross-modal Knowledge Distillation, Multimodal Learning, Mutual Information, Representation learning, Modality Selection, Modality Gap

TL;DR: This paper proposes that cross-modal knowledge distillation is successful when the mutual information between the teacher and student representations exceeds that between the student representation and the labels.

Abstract: The rapid increase in multimodal data availability has sparked significant interest in cross-modal knowledge distillation (KD) techniques, where richer "teacher" modalities transfer information to weaker "student" modalities during model training to improve performance. However, despite successes across various applications, cross-modal KD does not always result in improved outcomes, primarily due to a limited theoretical understanding that could inform practice. To address this gap, we introduce the Cross-modal Complementarity Hypothesis (CCH): we propose that cross-modal KD is effective when the mutual information between teacher and student representations exceeds the mutual information between the student representation and the labels. We theoretically validate the CCH in a joint Gaussian model and further confirm it empirically across diverse multimodal datasets, including image, text, video, audio, and cancer-related omics data. Our study establishes a novel theoretical framework for understanding cross-modal KD and offers practical guidelines based on the CCH criterion to select optimal teacher modalities for improving the performance of weaker modalities.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 17502

Loading