AutoViT: Achieving Real-Time Vision Transformers on Mobile via Latency-aware Coarse-to-Fine Search

Zhenglun Kong, Dongkuan Xu, Zhengang Li, Peiyan Dong, Hao Tang, Yanzhi Wang, Subhabrata Mukherjee

Published: 2025, Last Modified: 06 Nov 2025Int. J. Comput. Vis. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Despite their impressive performance on various tasks, vision transformers (ViTs) are heavy for mobile vision applications. Recent works have proposed combining the strengths of ViTs and convolutional neural networks (CNNs) to build lightweight networks. Still, these approaches rely on hand-designed architectures with a pre-determined number of parameters. In this work, we address the challenge of finding optimal light-weight ViTs given constraints on model size and computational cost using neural architecture search. We use a search algorithm that considers both model parameters and on-device deployment latency. This method analyzes network properties, hardware memory access pattern, and degree of parallelism to directly and accurately estimate the network latency. To prevent the need for extensive testing during the search process, we use a lookup table based on a detailed breakdown of the speed of each component and operation, which can be reused to evaluate the whole latency of each search structure. Our approach leads to improved efficiency compared to testing the speed of the whole model during the search process. Extensive experiments demonstrate that, under similar parameters and FLOPs, our searched lightweight ViTs achieve higher accuracy and lower latency than state-of-the-art models. For instance, on ImageNet-1K, AutoViT_XXS (71.3% Top-1 accuracy, 10.2ms latency) outperforms MobileViTv3_XXS (71.0% Top-1 accuracy, 12.5ms latency) with 0.3% higher accuracy and 2.3ms lower latency.

External IDs:dblp:journals/ijcv/KongXLDTWM25