Tiny-vGamba: Distilling Large Vision-(Language) Knowledge from CLIP into a Lightweight vGamba Network
Abstract: Deploying high-performing visual models on resource constrained edge devices remains a challenge due to the computational demands of architectures like Vision Transformers (ViT) and large vision-language models such as CLIP (Contrastive Language-Image Pre-training). In this work, we propose Tiny-vGamba, a lightweight visual recognition backbone designed for efficient deployment in low latency settings while efficiently modeling long-range dependencies. Building upon the original vGamba architecture, which combines Mamba-based state-space modeling with attention mechanisms, we simplify the design by removing memory-intensive components such as global self-attention and fusion gates, retaining a streamlined Mamba Gamba-Cell for efficient long-range dependency modeling. To further enhance generalization, we introduce a knowledge distillation framework that transfers visual-semantic knowledge from a frozen CLIP (ViT-B/32) teacher to the Tiny-vGamba student. This includes logit and feature-level supervision, mean-variance alignment, and lightweight channel attention to guide the student's learning. Experimental results on lightweight classification, detection, and segmentation tasks enhance performance, outperforming several lightweight baselines, which are well-suited for real-time edge applications.
Loading