Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Alexander Borzunov; Dmitry Baranchuk; Tim Dettmers; Max Ryabinin; Younes Belkada; Artem Chumachenko; Pavel Samygin; Colin Raffel

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, Colin Raffel

Published: 01 Feb 2023, Last Modified: 22 Jun 2025Submitted to ICLR 2023Readers: Everyone

Keywords: volunteer computing, distributed deep learning, distributed inference, efficient inference, large language models, gpt-3

TL;DR: We propose a practical algorithm for running large language models by pooling together weak geographically distributed devices. Our system can inference BLOOM-176B over the Internet more than 10x faster compared to RAM offloading.

Abstract: Large language models (LLMs) are useful in many NLP tasks and become more capable with size, scaling to over 100 billion parameters. With the release of BLOOM-176B and OPT-175B, everyone can download pretrained models of this scale. Still, using a pre-trained 100B+ model requires high-end hardware, making it inaccessible to most researchers. Recent studies in memory-efficient training (e.g. offloading) could alleviate these costs, but they do not cover important use cases of LLMs, such as autoregressive inference. In this work, we investigate methods for cost-efficient inference of large language models, comparing local and distributed strategies. We observe that a large enough model (100B+) could run efficiently on geodistributed devices in a consumer-grade network, for example by connecting existing compute resources of multiple research groups or pooling under-utilized compute from multiple cloud regions. To run LLMs in this unconventional setting, we develop a fault-tolerant algorithm for inferencing language models. We propose Petals - a decentralized system for running LLMs - and show that it can run BLOOM-176B over the Internet over $10\times$ faster than offloading for sequential generation. We evaluate the performance of our system in both simulated conditions and an actual distributed system spanning two continents. The design of Petals allows participants to inference, and fine-tune, or inference fine-tuned models simultaneously without affecting each other's results.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/distributed-inference-and-fine-tuning-of/code)

17 Replies

Loading