TL;DR: Improved rates for asynchronous stochastic optimization with better delay adaptivity via asynchronous mini-batching
Abstract: We consider the problem of asynchronous stochastic optimization, where an optimization algorithm makes updates based on stale stochastic gradients of the objective that are subject to an arbitrary (possibly adversarial) sequence of delays. We present a procedure which, for any given $q \in (0,1]$, transforms any standard stochastic first-order method to an asynchronous method with convergence guarantee depending on the $q$-quantile delay of the sequence. This approach leads to convergence rates of the form $O(\tau_q/qT+\sigma/\sqrt{qT})$ for non-convex and $O(\tau_q^2/(q T)^2+\sigma/\sqrt{qT})$ for convex smooth problems, where $\tau_q$ is the $q$-quantile delay, generalizing and improving on existing results that depend on the average delay. We further show a method that automatically adapts to all quantiles simultaneously, without any prior knowledge of the delays, achieving convergence rates of the form $O(\inf_{q} \tau_q/qT+\sigma/\sqrt{qT})$ for non-convex and $O(\inf_{q} \tau_q^2/(q T)^2+\sigma/\sqrt{qT})$ for convex smooth problems. Our technique is based on asynchronous mini-batching with a careful batch-size selection and filtering of stale gradients.
Lay Summary: Training modern machine learning models often involves huge datasets and running computations in parallel across many computing units. But when these systems update their models, they sometimes use outdated (or “stale”) information because different units report back at different times. This delay can slow down learning and degrade performance.
We developed new methods that allow training algorithms to effectively reduce the impact of delays by making fewer, more meaningful updates using the most relevant parts of the delayed computations. Instead of relying on the average delay, which might be sensitive to a few very slow responses, our methods adapt to how often certain delays occur. This shift can lead to faster and more stable training.
Our methods can be applied to many standard training algorithms with little to no modification, and they scale naturally with increasing parallelism, making them compelling options for large-scale systems. By adjusting how and when updates are made, they make better use of the available computations, even when some are delayed. As a result, our methods can help machine learning models learn faster and more reliably in real-world computing environments.
Primary Area: Theory->Optimization
Keywords: Asynchronous, delay, stochastic, optimization, arbitrary, mini-batching, batching
Submission Number: 6514
Loading