Accelerating Distributed Model Training through Intelligent Node Selection and Data Allocation Strategies in 6G network

Kaice Gao; Yuhao Chai; Yixuan Li; Zhenyu Zhang; Lu Lu; Qin Li; Yong Zhang

Accelerating Distributed Model Training through Intelligent Node Selection and Data Allocation Strategies in 6G network

Kaice Gao, Yuhao Chai, Yixuan Li, Zhenyu Zhang, Lu Lu, Qin Li, Yong Zhang

Published: 01 Jan 2024, Last Modified: 06 Feb 2025ICC Workshops 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Supporting artificial intelligence (AI) models training is one of the visions for future 6th generation (6G) networks. An extensive quantum of data and computational capabilities are necessitated for the training of AI models. However, with the development of AI models, it is evident that the existing edge computing network architectures are inadequate to meet the massive computing power and communication demands of distributed training for models with a growing number of parameters. In this paper, we propose a distributed training framework based on the edge-network-cloud architecture. Considering the architecture of the network and the computing capabilities of network nodes, the framework actively adapts the functional partitioning and allocation of data of the network nodes during the process of distributed training. Specifically, aggregation nodes are responsible for parameter aggregation and updating, while training nodes execute training tasks and transmit model gradients to the aggregation nodes asynchronously. To improve training efficiency and reduce communication time, we introduce a solution based on Deep Reinforcement Learning (DRL). The algorithm intelligently allocates suitable data to nodes and selects node types by task-related information, thus accelerating distributed training across network nodes. Experimental results demonstrate that the proposed algorithm effectively accelerates large-scale model training tasks.

Loading