Abstract: In Artificial Intelligence(AI), training expansive models with billions of parameters necessitates substantial computational resources. This requirement has led to the adoption of parallel computing frameworks. However, these frameworks often confront node performance imbalances due to disparities in computational capabilities and network conditions. To address this issue, we introduce the BalanceNet Orchestrator(BNO), a dynamic task allocation algorithm designed to equilibrate workloads in parallel training environments. BalanceNet Orchestrator assesses and adjusts to node-specific performance in real time, facilitating optimal workload distribution and resource utilization. This method significantly enhances training efficiency and accelerates model convergence, presenting an efficient approach for training large-scale AI models within parallel training architecture.
External IDs:dblp:conf/icoin/SunNH24
Loading