MatchVIT: Light-weight Vision Transformer with Matching Separable Self-attention

MatchVIT: Light-weight Vision Transformer with Matching Separable Self-attention

ICLR 2026 Conference Submission16361 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Transformer, Light-weighted model, Computer vision

Abstract: Vision Transformers (ViTs) have emerged as a powerful alternative to Convolutional Neural Networks (CNNs) in various vision tasks. ViTs process images as sequences of patches and capture long-range dependencies through Multi-Head Self-Attention (MHSA). Hybrid CNN-ViT architectures further enhance performance by integrating the local inductive bias of CNNs with the global contextual information of ViTs. However, the quadratic complexity of self-attention limits its efficiency as the number of tokens rises. Separable Self-Attention (SSA) in MobileViTv2 reduces computational overhead by aggregating contextual information into a single vector and applying the vector to all tokens. Despite this improvement, SSA exhibits limitations compared to MHSA, including extracting only a single level of features, and lacking the ability for tokens to selectively acquire relevant information. These shortcomings further confine the performance of SSA. To address these issues, we propose MatchViT as a novel hybrid CNN-ViT model. In MatchViT, we introduce Matching Separable Self-Attention (MaSSA), which employs multi-head processing and matching mechanism to enable tokens to individually gather information across hidden tokens. Moreover, Context-gated FFNs in MatchViT leverage the information gathered in MaSSA for enhanced performance. By adopting MaSSA and context-gated FFN, MatchViT achieves a 1\%–3\% accuracy improvement in Image Classification tasks compared to various other vision models with identical MACs. Other experimental results demonstrate that MatchViT overcomes shortcomings in MobileViTv2, achieving superior accuracy with low computational costs across diverse vision tasks.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 16361

Loading