MDP: Model Decomposition and Parallelization of Vision Transformer for Distributed Edge Inference

Weiyan Wang, Yiming Zhang, Yilun Jin, Han Tian, Li Chen

Published: 01 Jan 2023, Last Modified: 07 Aug 2024MSN 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Distributed edge inference emerges to be a promising paradigm to speed up inference. Previous works make physical partitions on CNNs to realize it, but there are the following challenges for vision transformers: (1) high communication costs for the large model; (2) stragglers because of heterogeneous devices; (3) time-out exceptions due to unstable edge devices.Therefore, we propose a novel Model Decomposition and Parallelization(MDP) for large vision transformers. Inspired by the implicit boosting ensemble in the vision transformer, MDP decomposes it into an explicit boosting ensemble of different and parallel sub-models. It sequentially trains all sub-models to gradually reduce the residual errors. To minimize dependency and communication among sub-models, We adopt stacking distillation to bring every sub-model extra information about others for better error correction. Different sub-models can take both different image sizes and model sizes to run on heterogeneous devices and improve the ensemble diversities. To handle the timeout exception, we add vanilla supervised learning on every submodel for the bagging ensemble in case of the early termination of boosting ensemble. As a result, all sub-models can not only run in parallel without much communication but also can be adapted to the heterogeneous devices, while maintaining accuracy even with time-out exceptions. Experiments show that MDP can outperform other baselines by $5 . 2 \times \sim 2 . 1 \times$ in latency and $5 . 1 \times \sim 1 . 7 \times$ in throughput with comparable accuracy.