Optimistic Asynchrony Control: Achieving Synchronous Convergence With Asynchronous Throughput for Embedding Model Training

Published: 18 Jun 2024, Last Modified: 10 Jul 2024WANT@ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: embedding model training, asynchronous training, convergence, graph neural networks, GNNs
Abstract: Modern embedding-based machine learning (ML) models can contain hundreds of gigabytes of parameters, often exceeding the capacity of GPU hardware accelerators critical for training. One solution is to use a mixed CPU-GPU setup, where embedding parameters are stored in CPU memory and subsets are repeatedly transferred to the GPU for computation. In this setup two training paradigms exist: synchronous training and asynchronous training. In the former, batches are transferred one by one, leading to low throughput but fast model convergence. In contrast, during asynchronous training batches are transferred in parallel, allowing for more batches to be processed per unit time. Asynchronous training, however, can effect model quality due to concurrent batches which access the same model parameters leading to stale updates. In this work, we present Optimistic Asynchrony Control, a method for allowing asynchronous batch processing while ensuring model equivalence to a synchronous training execution. Our method is inspired by Optimistic Concurrency Control used in database systems. The main idea is to allow parallel processing and transfer of batches from the CPU to the GPU, but to validate each batch on the GPU before the model is updated to ensure that it has the correct values---the values it would have had if batches were processed and transferred one by one. We show that OAC achieves the best of both worlds, retaining the convergence of synchronous training while matching the throughput of asynchronous ML. This allows OAC to achieve the best time-to-accuracy of the three methods for mixed CPU-GPU embedding model training.
Submission Number: 40
Loading