Secure Distributed Training at Scale

Eduard Gorbunov; Alexander Borzunov; Michael Diskin; Max Ryabinin

Secure Distributed Training at Scale

Eduard Gorbunov, Alexander Borzunov, Michael Diskin, Max Ryabinin

Published: 28 Jan 2022, Last Modified: 12 Oct 2025ICLR 2022 SubmittedReaders: Everyone

Keywords: distributed training, byzantine tolerance, volunteer computing

Abstract: Some of the hardest problems in deep learning can be solved via pooling together computational resources of many independent parties, as is the case for scientific collaborations and volunteer computing. Unfortunately, any single participant in such systems can jeopardize the entire training run by sending incorrect updates, whether deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server. As a result, it can be infeasible to apply such algorithms to large-scale distributed deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency. We rigorously analyze this protocol: in particular, we provide theoretical bounds for its resistance against Byzantine and Sybil attacks and show that it has a marginal communication overhead. To demonstrate its practical effectiveness, we conduct large-scale experiments on image classification and language modeling in presence of Byzantine attackers.

One-sentence Summary: We propose and rigorously analyze a protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/secure-distributed-training-at-scale/code)

33 Replies

Loading