Cross-Architecture Knowledge Distillation via Information Alignment

Zheng Qu; Xiwen Yao; Xuguang Yang; Shuai Wang; Gong Cheng; Junwei Han

Cross-Architecture Knowledge Distillation via Information Alignment

Zheng Qu, Xiwen Yao, Xuguang Yang, Shuai Wang, Gong Cheng, Junwei Han

18 Sept 2025 (modified: 26 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: cross-architecture, knowledge distillation, feature alignment

TL;DR: A novel feature-based knowledge distillation method for transferring knowledge from Transformers to CNNs.

Abstract: Transformer architectures have demonstrated remarkable success in capturing long-range dependencies and global contextual information, whereas Convolutional Neural Networks (CNNs) remain dominant in many industrial applications due to their efficiency and strong local feature modeling. Bridging the complementary strengths of these architectures, Cross-Architecture Knowledge Distillation (CAKD) has emerged as a promising approach to transfer global knowledge from Transformers to CNNs. However, existing methods either rely on generic distillation strategies that fail to address inductive bias discrepancies, or reduce informative features to logits, which limits generalization across tasks. To overcome these issues, we propose a novel feature-based framework that aligns representations from both structural and semantic perspectives. Structurally, we refine a global information supplement module to extract residual cues through global-local comparison, facilitating more compatible feature transfer. Semantically, we apply the $\ell_{1}$-regularization to encourage sparse and meaningful global compensation patterns, mimicking Transformer's attention outputs. Extensive experiments on image classification and instance segmentation benchmarks demonstrate that our method effectively mitigates the feature misalignment between Transformers and CNNs, yielding consistent improvements over state-of-the-art works, with up to 2.7\% gains on CIFAR-100 and 0.9\% on ImageNet-1K, respectively.

Primary Area: learning theory

Submission Number: 11462

Loading