ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose a method that learns machine speeds on the fly to assign tasks more efficiently in parallel computing.
Abstract: Asynchronous methods are fundamental for parallelizing computations in distributed machine learning. They aim to accelerate training by fully utilizing all available resources. However, their greedy approach can lead to inefficiencies using more computation than required, especially when computation times vary across devices. If the computation times were known in advance, training could be fast and resource-efficient by assigning more tasks to faster workers. The challenge lies in achieving this optimal allocation without prior knowledge of the computation time distributions. In this paper, we propose ATA (Adaptive Task Allocation), a method that adapts to heterogeneous and random distributions of worker computation times. Through rigorous theoretical analysis, we show that ATA identifies the optimal task allocation and performs comparably to methods with prior knowledge of computation times. Experimental results further demonstrate that ATA is resource-efficient, significantly reducing costs compared to the greedy approach, which can be arbitrarily expensive depending on the number of workers.
Lay Summary: These days, many computers and devices can be used to train machine learning models. However, they often run at different speeds, and we typically don’t know how fast each one is ahead of time. A common approach is to give all machines the same task and only use the results from the fastest ones. But if we need fewer results than the number of available devices, this leads to significant waste — slower machines still do the work, but their output gets ignored. This paper studies how to assign tasks more intelligently when machine speeds are unknown and unpredictable. The goal is to complete the required work efficiently without wasting computational resources. Ideally, faster machines would handle more tasks, and slower ones fewer — but without knowing speeds in advance, this is challenging. We introduce ATA (Adaptive Task Allocation), a method that learns how fast each machine is over time and adapts the task assignment accordingly. Both theoretical analysis and experiments show that ATA performs nearly as well as if the machine speeds were known in advance.
Primary Area: Theory->Online Learning and Bandits
Keywords: Multi-Armed Bandit, UCB, adaptive task allocation, asynchronous methods, parallel methods, SGD
Submission Number: 10389
Loading