Co-Designing Transformer Architectures for Distributed Inference With Low Communication

Jiangsu Du, Yuanxin Wei, Shengyuan Ye, Jiazhi Jiang, Xu Chen, Dan Huang, Yutong Lu

Published: 01 Apr 2025, Last Modified: 05 Nov 2025IEEE Transactions on Parallel and Distributed SystemsEveryoneRevisionsCC BY-SA 4.0

Abstract: Transformer models have shown significant success in a wide range of tasks. However, the massive resources required for its inference prevent deployment on a single device with relatively constrainted resources, thus leaving a high threshold of integrating their advancements. Observing scenarios such as smart home applications on edge devices and cloud deployment on commodity hardware, it is promising to distribute Transformer inference across multiple devices. Unfortunately, due to the tightly-coupled feature of Transformer model, existing model parallelism approaches necessitate frequent communication to resolve data dependencies, making them unacceptable for distributed inference, especially under relatively weak interconnection. In this paper, we propose DeTransformer, a communication-efficient distributed Transformer inference system. The key idea of DeTransformer involves the co-design of Transformer architecture to reduce the communication during distributed inference. In detail, DeTransformer is based on a novel block parallelism approach, which restructures the original Transformer layer with a single block to the decoupled layer with multiple sub-blocks. Thus, it can exploit model parallelism between sub-blocks. Next, DeTransformer contains an adaptive execution approach that strikes a trade-off among communication capability, computing power and memory budget over multiple devices. It incorporates a two-phase planning for execution, namely static planning and runtime planning. The static planning runs offline, containing a profiling procedure and a weight placement strategy before execution. The runtime planning dynamically determines the optimal parallel computing strategy from an expertly crafted search space based on real-time requests. Notably, this execution approach can adapt to heterogeneous devices by distributing workload based on devices’ computing capabilities. We conduct experiments for both auto-regressive and auto-encoder tasks of Transformer models. Experimental results show that DeTransformer can reduce distributed inference latency by up to 2.81× compared to the SOTA approach on 4 devices, while effectively maintaining task accuracy and a consistent model size.

External IDs:doi:10.1109/tpds.2024.3521582