Probabilistic Knowledge Transfer for Lightweight Vision Transformers

Ioannis Bountouris, Nikolaos Passalis, Anastasios Tefas

Published: 01 Jan 2024, Last Modified: 28 Jan 2025EUSIPCO 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The advent of Vision Transformers has reshaped the landscape of computer vision and robotics, harnessing the ability of the Transformer architecture to capture intricate long-range dependencies within data. Nevertheless, the computational demands of Transformers remain persistent, prompting the need for methodologies focused on model compression. Knowledge distillation has emerged as a viable solution, facilitating the transfer of insights from larger deep learning models to more compact versions while maintaining accuracy. Among them, Probabilistic Knowledge Transfer (PKT) has been proven particularly effective since instead of relying solely on the final predictions of the teacher, it models the conditional probabilities of data representations extracted through various layers to capture the underlying geometry of the data representations. However, applying PKT in the context of Transformer models is not straightforward, leading to significant challenges (e.g., density estimation in high dimensional spaces). The main contribution of this work is a patch-based knowledge distillation approach, building upon the powerful modeling capabilities of PKT and focusing on Vision Transformers architectures. To this end, we employ a lightweight Vision Transformer architecture, targeting deployment on robotics and embedded systems. We demonstrate the effectiveness of the proposed approach through comprehensive experiments on CIFAR-10 and Tiny-Imagenet datasets.