SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20\% overall inference latency reduction with < 1\% accuracy regression for LLaMA2-70B inference over 8 GPUs.
Lay Summary: Serving Large Language Models (LLMs) requires distributing computation across multiple GPUs to handle their size and complexity efficiently. However, this distribution introduces delays due to frequent synchronization between devices during model execution. We developed a technique called Sync-Point Drop (SPD) that selectively removes unnecessary synchronization steps while running the model. This significantly reduces delay and speeds up response generation. SPD delivers these improvements without notable loss in model accuracy and without requiring any hardware changes. Our approach enables faster and more efficient deployment of large-scale AI systems, making them more practical and cost-effective in real-world applications.
Primary Area: Optimization
Keywords: sync point drop, tensor parallelism, distributed inference, model optimization
Submission Number: 7553
Loading