Communication-Efficient Multi-Device Inference Acceleration for Transformer Models

Communication-Efficient Multi-Device Inference Acceleration for Transformer Models

ICLR 2026 Conference Submission16127 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-device inference, communication-efficient transformers, vector quantization

Abstract: Transformer models power many AI applications but suffer from high inference latency, limiting their use in real-time settings. Multi-device inference can reduce latency by parallelizing computation. Yet, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We propose ASTRA, a communication-efficient framework that accelerates Transformer inference through a novel integration of sequence parallelism and a Mixed-Precision Attention mechanism designed to minimize inter-device communication. ASTRA compresses non-local token embeddings via vector quantization and preserves task accuracy through two optimizations, Noise-Augmented Quantization and Distributed Class Tokens. Experiments on ViT and GPT2 across vision and NLP tasks show that Astra achieves up to 2.64$\times$ speedups over single-device inference and up to 15.25$\times$ speedups over state-of-the-art multi-device inferences, while operating under bandwidths as low as 10 Mbps.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 16127

Loading