Mitigating Data Stalls in Deep Learning with Multi-times Data Loading Rule

Derong Chen, Shuang Liang, Gang Hu, Han Xu, Xianqiang Luo, Hao Li, Jie Shao

Published: 2023, Last Modified: 19 May 2025DASFAA (1) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the growth of AI data scale, most deep learning jobs separate data storage and computation tasks. Therefore, I/O optimization has gradually become an important issue for training optimization of deep learning. Recent studies focus on I/O optimization for those deep learning methods with one-time data loading rule, where each data is used only once per epoch. However, these methods cannot deal with some jobs with multi-times data loading rule (e.g., meta-learning). By analyzing the characteristic of multi-times data loading, we design a simple, intuitive and effective cache replacement strategy called steady cache strategy. This strategy utilizes a cache to mitigate data stalls and converts the data placement problem to a 0/1 knapsack problem. To our best knowledge, we are the first to mitigate data stalls in AI jobs with multi-times data loading rule and our method is suitable for multi-job scenario. Our experiments demonstrate that the steady cache strategy achieves great improvement over the LRU strategy.