Keywords: Large Language Model, Pipeline Parallelism, KV Cache, Sample
TL;DR: DynaPipe dynamically redistributes layers and uses asynchronous coordination to balance computation during LLM inference, significantly reducing latency and outperforming existing pipeline parallelism systems.
Abstract: To accelerate large language model (LLM) inference, pipeline parallelism partitions model layers into sequential stages, each assigned to a different device for concurrent execution. However, this method often suffers from pipeline bubbles caused by imbalanced computation in the tail stage. While upstream stages focus solely on layer-forward operations, the final stage must also handle post-processing tasks like sampling, introducing significant latency. This uneven workload leads to pipeline misalignment, forcing upstream stages to idle and degrading overall performance. Existing frameworks typically distribute layers evenly across stages without accounting for computational load differences. To address this, we propose DynaPipe, a dynamic layer redistribution scheme that adaptively balances computation by predicting execution latency in real time. Moreover, we introduce an asynchronous key-value (KV) cache migration coordinator to enable
non-blocking layer redistribution during inference. Experiments on representative LLMs demonstrate that DynaPipe reduces average end-to-end request latency by 8% to 49% across diverse workloads, outperforming state-of-the-art pipeline parallelism systems.
Supplementary Material: zip
Primary Area: Infrastructure (e.g., libraries, improved implementation and scalability, distributed solutions)
Submission Number: 11912
Loading