Abstract: To achieve better performance, researchers have recently focused on building larger deep learning models, substantially increasing the training costs and prompting the development of distributed training within GPU clusters. However, conventional distributed training approaches suffer from limitations: data parallelism is hindered by excessive memory demands and communication overhead during gradient synchronization, while model parallelism fails to achieve optimal device utilization due to strict computational dependencies. To overcome these challenges, researchers have proposed the concept of hybrid parallelism. By segmenting the model into multiple stages that may internally utilize data parallelism and sequentially processing split training data in a pipeline-like manner across different stages, hybrid parallelism enhances model training speed. However, widely used freezing mechanisms in model fine-tuning, namely canceling gradient computation and weight updates for converged parameters to reduce computational overhead, are yet to be efficiently integrated within hybrid parallel training, failing to strike a balance between speeding up training and guaranteeing accuracy and further reducing the time required for the model to reach a converged state. In this paper, we propose Reinforcement Learning Freeze (RLFreeze), a freezing strategy for distributed DNN training in heterogeneous GPU clusters, especially in hybrid parallelism. We first introduce a mixed freezing criterion based on gradients and gradient variation to accurately freeze converged parameters while minimizing the freezing of unconverged ones. Then, RLFreeze selects the parameters to be frozen according to this criterion and dynamically adjusts the required thresholds for freezing decisions during training using reinforcement learning, achieving a balance between accuracy and accelerated model training. Experimental results demonstrate that RLFreeze can further improve training efficiency in both data parallelism and hybrid parallelism while maintaining model accuracy.
External IDs:dblp:journals/dint/GongCDFNZ25
Loading