TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Zonghang Li; Wenjiao Feng; Mohsen Guizani; Hongfang Yu

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Zonghang Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu

27 Sept 2024 (modified: 13 May 2025)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: DML Systems, Edge LLM Serving, Tensor Parallelism, Memory Scheduling

TL;DR: This work can serve 70B-scale LLMs efficiently using multiple edge devices with limited computing power, memory, and bandwidth.

Abstract: Large model inference is shifting from cloud to edge due to concerns about the privacy of user interaction data. However, edge devices often struggle with limited computing power, memory, and bandwidth, requiring collaboration across multiple devices to run and speed up LLM inference. Pipeline parallelism, the mainstream solution, is inefficient for single-user scenarios, while tensor parallelism struggles with frequent communications. In this paper, we argue that tensor parallelism can be more effective than pipeline on low-resource devices, and present a compute- and memory-efficient tensor parallel inference system, named TPI-LLM, to serve 70B-scale models. TPI-LLM keeps sensitive raw data local in the users' devices and introduces a sliding window memory scheduler to dynamically manage layer weights during inference, with disk I/O latency overlapped with the computation and communication. This allows larger models to run smoothly on memory-limited devices. We analyze the communication bottleneck and find that link latency, not bandwidth, emerges as the main issue, so a star-based allreduce algorithm is implemented. Through extensive experiments on both emulated and real testbeds, TPI-LLM demonstrated over 80\% less time-to-first-token and token latency compared to Accelerate, and over 90\% compared to Transformers and Galaxy, while cutting the peak memory footprint of Llama 2-70B by 90\%, requiring only 3.1 GB of memory for 70B-scale models.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8847

Loading