Abstract: In-network aggregation (INA) offloads gradient aggregation onto switches, and thus effectively reduces the aggregation latency and the volume of traffic. However, INA resources are limited due to the high cost of on-chip memory on switches, which imposes distinct challenges to the effective scheduling of these resources in multi-job Machine Learning as a Service (MLaaS) scenarios. In this paper, we explore the scheduling of INA resources in spatial and temporal dimensions, specifically focusing on its impact on the average job completion time (JCT) and the efficiency of INA resources. We propose Mina, an innovative co-design of algorithm and system that intelligently assigns INA resources to each job and effectively schedules these resources among multiple jobs. Our experiments show that Mina attains an INA efficiency score of 0.9099 on average, $2.67\times $ higher than the baseline, implying that almost all jobs run nearly as efficiently as they would with exclusive INA acceleration. Furthermore, Mina proves to be highly adaptable to varied cluster configurations and incurs only minimal additional overhead.
External IDs:doi:10.1109/ton.2025.3617081
Loading