Abstract: Multimedia content is of predominance in the modern Web era. Many recommender models have been proposed to investigate how to incorporate multimodal content information into traditional collaborative filtering framework effectively. The use of multimodal information is expected to provide more comprehensive information and lead to superior performance. However, the integration of multiple modalities often encounters the modal imbalance problem: since the information in different modalities is unbalanced, optimizing the same objective across all modalities leads to the under-optimization problem of the weak modalities with a slower convergence rate or lower performance. Even worse, we find that in multimodal recommendation models, all modalities suffer from the problem of insufficient optimization.
To address these issues, we propose a Counterfactual Knowledge Distillation (CKD) method which could solve the imbalance problem and make the best use of all modalities. Through modality-specific knowledge distillation, CKD could guide the multimodal model to learn modality-specific knowledge from uni-modal teachers. We also design a novel generic-and-specific distillation loss to guide the multimodal student to learn wider-and-deeper knowledge from teachers. Additionally, to adaptively recalibrate the focus of the multimodal model towards weaker modalities during training, we estimate the causal effect of each modality on the training objective using counterfactual inference techniques, through which we could determine the weak modalities, quantify the imbalance degree and re-weight the distillation loss accordingly.
Our method could serve as a plug-and-play module for both late-fusion and early-fusion backbones. Extensive experiments on six backbones show that our proposed method can improve the performance by a large margin.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: The paper addresses the challenge of effectively integrating multimodal content information into collaborative filtering frameworks for multimedia recommendation. Existing models often encounter modal imbalance, where different modalities converge at different rates, leading to the under-optimization of slower modalities. The proposed Counterfactual Knowledge Distillation (CKD) method addresses this by guiding the multimodal model to learn from unimodal teachers, using modality-specific knowledge distillation and a novel generic-and-specific distillation loss. Additionally, counterfactual inference techniques are employed to estimate the causal effect of each modality on the training objective, allowing for adaptive recalibration of focus towards weaker modalities. The method is compatible with both late-fusion and early-fusion backbones and significantly improves performance across various datasets and state-of-the-art recommendation models.
Supplementary Material: zip
Submission Number: 1280
Loading