Investigating the Impact of Data Distribution Shifts on Cross-Modal Knowledge Distillation

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Cross-Modal Knowledge Distillation; Data distribution shifts
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Cross-modal knowledge distillation (KD) has expanded the traditional KD approach to encompass multimodal learning, achieving notable success in various applications. However, in cases where there is a considerable shift in data distribution during cross-modal KD, even a more accurate teacher model may not effectively instruct the student model. In this paper, we conduct a comprehensive analysis and evaluation of the effectiveness of cross-modal KD, focusing on its dependence on the distribution shifts in multimodal data. We initially view cross-modal KD as training a maximum entropy model using pseudo-labels and establish conditions under which it outperforms unimodal KD. Subsequently, we introduced the hypothesis of solution space divergence, which unveils the crucial factor influencing the efficacy of cross-modal KD. Our key observation is that the accuracy of the teacher model is not the primary determinant of the student model's accuracy; instead, the data distribution shifts play a more significant role. We demonstrate that as the data distribution shifts decrease, the effectiveness of cross-modal KD improves, and vice versa. Finally, to address significant data distribution differences, we propose a method called the ``perceptual solution space mask'' to enhance the effectiveness of cross-modal KD. Through experimental results on four multimodal datasets, we validate our assumptions and provide directions for future enhancements in cross-modal knowledge transfer. Notably, our enhanced KD method demonstrated an approximate 2\% improvement in \emph{mIoU} compared to the Baseline on the SemanticKITTI dataset.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2487
Loading