Abstract: Transfer Entropy (TE) is a principled measure of directed information flow, but its direct estimation in high-dimensional multimodal representation spaces is computationally prohibitive. In this work, we propose a practical distillation framework that replaces direct TE estimation with TE-inspired proxy regularization for multimodal vision-language models. Our method introduces proxy objectives that reward student representations for preserving teacher-aligned predictive structure across modalities, while remaining compatible with standard contrastive distillation losses. We instantiate the framework in CLIP-style teacher--student distillation across multiple teacher backbones, including CLIP RN50, ViT-B/16, and RN50$\times$16, and evaluate it on MSCOCO 2014, Flickr8k, Flickr30k, Food-101, and ImageNet-1k. Across retrieval experiments, the proposed TE-inspired objective consistently improves Image-to-Text performance over MI-based and standard distillation baselines, while remaining competitive on Text-to-Image retrieval. Additional recipe-level diagnostics across temperature and batch size show that these gains are reproducible and are not explained solely by a favorable training recipe. Representation-level analyses further show that TE-inspired distillation yields stronger teacher-student agreement in local neighborhood structure, cosine alignment, and joint image-text embedding geometry. Beyond in-dataset evaluation, cross-dataset retrieval from MSCOCO to Flickr8k shows that the proposed objective better preserves transferable multimodal structure under distribution shift. We also observe improvements in zero-shot classification on Food-101 and out-of-dataset evaluation on ImageNet-1k. Together, these results suggest that TE-inspired proxy regularization provides an effective and scalable mechanism for preserving teacher-consistent cross-modal structure during multimodal distillation.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=i6gyBJl7sK
Changes Since Last Submission: **1. Method Positioning and Implementation**
Since the last submission, we revised the manuscript to make the connection between the paper’s high-level framing, implemented objective, and empirical validation more explicit. In particular, we now state throughout that optimization-step transfer entropy in Eq. (3) serves as a conceptual motivation rather than a quantity directly estimated or optimized in practice. The manuscript also makes the methodological chain clearer: Eq. (3) motivates a perturbation-defined one-step construction, which leads to the Jacobian-cosine relation in Theorem 1 and is ultimately instantiated through the within-batch finite-difference proxy objectives TE1/TE2 in Algorithm 1. Thus, the practical training signal is the final proxy objective, not a direct realization of Eq. (3). We revised the Abstract, Introduction, Method, and summary statements accordingly so that the paper-level claims are aligned with what is actually optimized and empirically evaluated.
We also made a small revision to Algorithm 1 so that it matches the implementation used in our experiments. Specifically, we updated the final parameter-update step to reflect the actual optimizer used in practice, namely Adam, rather than a generic gradient-descent update. We confirm that all reported experimental results in the manuscript were obtained using the implementation reflected in the current Algorithm 1. The corresponding code is provided in the supplementary material to support reproducibility and to eliminate ambiguity about the training procedure. We have added this explicit clarification to the last paragraph of Section 4.4.
Beyond these changes, we revised terminology and exposition throughout the paper for consistency. In particular, we now refer to TE1/TE2 as TE-inspired proxies or proxy objectives rather than implying that they are direct TE estimators. We also revised the appendices to match this framing: Appendix D is now presented as TE-Inspired Proxies versus an Exact Gaussian TE reference, and the computational-cost in Appendix E now describes the actual within-batch finite-difference cosine proxy computation rather than older TE-estimation language.
**2. Experimental Design and Evidence**
On the empirical side, we substantially expanded the experiments so that the evidence better matches the clarified proxy-level claims.
First, we added Section 5.1.3, Training Dynamics and Recipe-Level Diagnostics, to address concerns about KL/MSE behavior and recipe sensitivity. This new section examines training trajectories under multiple temperatures and batch sizes, rather than relying only on a fixed recipe. These diagnostics show that lowering the temperature yields clearer KL reduction, while changing batch size does not substantially alter the qualitative KL/MSE behavior. At the same time, the TE-based gains, especially for image-to-text retrieval, remain reproducible across these settings. This helps show that the improvements are not merely artifacts of one particular choice of temperature or negative-pool strength.
Second, we added a new cross-dataset evaluation in Section 5.1.4. Specifically, we distill an RN34 student from an RN50 teacher on MSCOCO and evaluate zero-shot retrieval on Flickr8k without additional fine-tuning. This experiment was added to test whether the implemented proxy objective preserves transferable multimodal structure beyond the source training distribution. The new results show that the TE-inspired proxy performs better than the MI-based baseline in this cross-dataset setting, strengthening the paper’s generalization claim at the level of the implemented regularizer.
Third, we added Section 5.2, Representation-Level Diagnostics, to provide a more direct mechanism analysis. This section examines how the proxy regularizer changes the learned embedding geometry under both settings: Section 5.2.1 at $\tau=0.07$ (previously Appendix H.5) and Section 5.2.2 at $\tau=0.03$ (newly added in the revision). The new analyses include teacher-only PCA projections, joint image-text visualizations, kNN neighborhood agreement, and teacher-student cosine alignment metrics. Together, these results show that TE-inspired regularizer yields more teacher-faithful local and global geometry than the MI-based baseline, and they further connect these structural differences to the observed retrieval gains. This strengthens the paper’s mechanism discussion by grounding it in both qualitative and quantitative diagnostics.
Fourth, we have moved the classification experiments (previously Appendix H.3 and H.4) to the main paper as Section 5.6. These changes were made to better highlight the effectiveness and generality of the proposed TE-inspired approach in the main empirical presentation.
Assigned Action Editor: ~Zhe_Gan1
Submission Number: 7940
Loading