Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Architecture modification to allow full overlapping of communication within Tensor Parallelism. 29% inference speed up on 70B size when applied to Transformer
Abstract: Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. **Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation.** While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens.
Lay Summary: As foundation models continue to scale, multi-GPU inference is crucial. Tensor Parallelism, a widely adopted distributed inference approach, divides weights and computation across all devices, which helps with both memory efficiency and speed. However, the inter-GPU communication turns out to be a major bottleneck of the overall latency. For a 70B model running with TP on 8 GPUs, the communication can account for 38% of the total inference time. We introduce Ladder-residual, a simple architecture tweak that allows computation and communication to happen in parallel—reducing latency without needing custom kernels or hardware changes. Here's a quick summary of what Ladder-residual achieves: * ~30% speedup for LLaMA 3.1-70B (TP=8) and LLaMA 3.1-405B (TP=16), and almost doubled speedup when fast interconnect (NVLink) is not available. Comparable performance to standard Transformer. * Can be applied to a pretrained model - we adapt LlaMA 3.1-8B and gained 23% speedup with no accuracy lost * Pure PyTorch level modification, no custom CUDA kernels needed, work on any hardware.
Link To Code: https://github.com/mayank31398/ladder-residual-inference/tree/main
Primary Area: Deep Learning->Large Language Models
Keywords: Language Model, Inference, Distributed Inference, Architecture, Efficiency, Parallelism
Submission Number: 7644
Loading