EXO Gym: a simulation environment for low-bandwidth training

Published: 09 Jun 2025, Last Modified: 14 Jul 2025CODEML@ICML25EveryoneRevisionsBibTeXCC BY 4.0
Keywords: machine learning, distributed training
TL;DR: ABC Gym lets you test and benchmark low-bandwidth distributed training strategies - like DiLoCo or SPARTA - on a single machine by simulating many virtual workers, removing the need for expensive multi-node clusters.
Abstract: Traditional algorithms for training multi-billion parameter models require clusters of GPUs connected via proprietary high-bandwidth networking equipment. Modern low-bandwidth training algorithms such as DiLoCo and SPARTA promise to remove this bandwidth constraint. However, testing them still demands multi-node hardware and complex orchestration. We introduce ABC Gym, an open-source library that emulates up to M virtual workers on N physical accelerators, letting researchers prototype and benchmark distributed-training strategies from a single workstation. Communication behaviour is encapsulated in modular Strategy classes, so new optimizers, sparsity schedules or compression schemes can be expressed in a few lines of code and evaluated with full telemetry (loss, wall-clock, GPU utilization, bytes transferred). In experiments, ABC Gym reproduces published DiLoCo scaling on language models, extends the algorithm to convolutional networks, and enables a rapid sweep over SPARTA sparsity rates that would cost weeks on cloud resources. By collapsing the infrastructure barrier, ABC Gym puts exploratory distributed training within reach of small teams and paves the way for broader, faster progress in open-source AI.
Submission Number: 27
Loading