Keywords: Efficient image classification, Efficient shape classification, Transformers, Early exit
TL;DR: We enable image and shape classification with multiple orders of magnitude lesser MACs, through a novel early exit Vision Transformer coupled with a simpler ResNet model for simpler data instances.
Abstract: Efficient computation for Vision Transformers (ViTs) is critical for latency-sensitive applications. However, early-exit schemes rely on auxiliary controllers that introduce non-trivial overhead. We propose UWYN, an end-to-end framework for image classification and shape classification tasks that embeds exit decisions directly within the transformer by reusing the classification head at each residual block. UWYN first partitions inputs via a lightweight feature-threshold into “simple” and “complex” samples: simple samples are routed to a shallow ResNet branch, while complex samples traverse the ViT and terminate as soon as their per-block confidence exceeds a preset confidence level. During the ViT pass,UWYN also dynamically prunes redundant patch embeddings and attention heads to further cut computation. We implement and evaluate this strategy on both 2D (ImageNet, CIFAR-10, CIFAR-100, SVHN, BloodMNIST) and 3D (ModelNet-40, Scan Object NN) benchmarks. UWYN reduces Multiply-Accumulate operations (MACs) by over 75% compared to SOTA models, e.g., LGViT [ACM MM ’23] achieving 83.29% accuracy on CIFAR-100 and 84.39% on ImageNet. We also show faster inference with minimal accuracy loss.
Supplementary Material: pdf
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 13854
Loading