Keywords: vision transformer, structural reparameterization, self-attention, latency, FLOPs
Abstract: Vision Transformers (ViTs) achieve remarkable performance on image classification tasks but suffer from computational inefficiency due to their deep architectures. While existing approaches focus on token reduction or attention optimization, the fundamental challenge of reducing architectural depth while maintaining representation capacity remains largely unaddressed. We propose a novel structural reparameterization approach that enables training of parallel-branch transformer architectures which can be collapsed into efficient single-branch networks during inference. Our method progressively joins parallel branches at the inputs of non-linear functions during training. This allows the reparameterization of both the multi-head self-attention (MHSA) and feed forward network (FFN) modules during inference without approximation loss. When applied to DeiT-Tiny, our approach compresses the model from 12 layers to as few as \{3, 4, 6\} while preserving accuracy, delivering up to 37\% lower inference latency on mobile CPUs for ImageNet-1K classification. Our findings challenge the conventional wisdom that transformer depth is essential for strong performance, opening new directions for efficient ViT design.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 5272
Loading