OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training

Published: 21 Jun 2024, Last Modified: 26 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Distributed Training, Decentralized AI, Local-SGD, DiLoCo
TL;DR: OpenDiLoCo is an open-source implementation and replication of the Distributed Low-Communication (DiLoCo) training method for large language models.
Abstract: OpenDiLoCo is an open-source implementation and replication of the Distributed Low-Communication (DiLoCo) training method for large language models. We provide a reproducible implementation of the DiLoCo experiments, offering it within a scalable, decentralized training framework using the Hivemind library. We demonstrate its effectiveness by training a model across two continents and four countries. Additionally, we conduct an analytical evaluation of its practicality, focusing on the algorithm's compute efficiency and scalability in the number of workers. Our findings indicate that while DiLoCo can be effective in specific scenarios, it is not necessarily a low-communication replacement for Distributed Data Parallel training due to its lower compute efficiency over a smaller number of steps.
Submission Number: 63
Loading