MMDataLoader: Reusing Preprocessed Data Among Concurrent Model Training Tasks

Published: 01 Jan 2024, Last Modified: 18 May 2025IEEE Trans. Computers 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Data preprocessing plays an important role in deep learning, which directly affects the training efficiency. Data preprocessing is performed on the CPU. The preprocessed data are then fed to the models that are trained on the GPU. We observe that data preprocessing on the CPU can potentially create a bottleneck in the entire process of a model training task. In order to tackle this issue, we have developed MMDataLoader, which enables reusing preprocessed data among multiple model training tasks. MMDataLoader automatically constructs a data preprocessing pipeline based on each task's specific preprocessing workflow, allowing for maximum data reuse and reduced computing workload on the CPU. Unlike conventional data loaders that operate at the task level and provide data provision services to specific training tasks, MMDataLoader operates at the server level and provides data for all concurrently running tasks. We have conducted extensive experiments. The results show that MMDataLoader can significantly increase preprocessing throughput without affecting model convergence when compared to conventional methods where model training tasks are executed concurrently. For instance, with three tasks running, the preprocessing throughput can increase by 1.6x to 3.15x, depending on the tasks being executed and the proportion of preprocessing operations that are shared among them.
Loading