Lightweight Vision Transformer Course-to-Fine Search via Latency Profiling

Lightweight Vision Transformer Course-to-Fine Search via Latency Profiling

TMLR Paper1409 Authors

23 Jul 2023 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite their impressive performance on various tasks, vision transformers (ViTs) are heavy for mobile vision applications. Recent works have proposed combining the strengths of ViTs and convolutional neural networks (CNNs) to build lightweight networks. Still, these approaches rely on hand-designed architectures with a pre-determined number of parameters. This requires re-running the training process to obtain different lightweight ViTs under different resource constraints. In this work, we address the challenge of finding optimal lightweight ViTs given constraints on model size and computational cost using neural architecture search. Using the proposed method, we first train a supernet, which is a hybrid architecture of CNNs and transformers. To efficiently search for the optimal architecture, we use a search algorithm that considers both model parameters and on-device deployment latency. This method analyzes network properties, hardware memory access pattern, and degree of parallelism to directly and accurately estimate the network latency. To prevent the need for extensive testing during the search process, we use a lookup table based on a detailed breakdown of the speed of each component and operation, which can be reused to evaluate the whole latency of each search structure. Our approach leads to improved efficiency compared to testing the speed of the whole model during the search process. With extensive experiments on ImageNet, we demonstrate that under similar parameters and FLOPs, our searched lightweight ViTs have higher accuracy and lower latency compared to state-of-the-art models. For example, our AutoViT\_XXS (71.3\% Top-1 accuracy of and 10.2ms latency) has a 2.3\% higher accuracy and 4.5ms lower latency compared to MobileViT\_XXS (69.0\% Top-1 accuracy of and 14.7ms latency).

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission:

We revised the paper according to the reviewers' valuable comments. Changes are highlighted in blue text.

Assigned Action Editor: Zhiding Yu

Submission Number: 1409

Loading