- TL;DR: We propose a simple, general, and space-efficient data format to accelerate deep learning training by allowing sample fidelity to be dynamically selected at training time
- Abstract: Deep learning training accesses vast amounts of data at high velocity, posing challenges for datasets retrieved over commodity networks and storage devices. We introduce a way to dynamically reduce the overhead of fetching and transporting training data with a method we term Progressive Compressed Records (PCRs). PCRs deviate from previous formats by leveraging progressive compression to split each training example into multiple examples of increasingly higher fidelity, without adding to the total data size. Training examples of similar fidelity are grouped together, which reduces both the system overhead and data bandwidth needed to train a model. We show that models can be trained on aggressively compressed representations of the training data and still retain high accuracy, and that PCRs can enable a 2x speedup on average over baseline formats using JPEG compression. Our results hold across deep learning architectures for a wide range of datasets: ImageNet, HAM10000, Stanford Cars, and CelebA-HQ.
- Keywords: Deep Learning, Storage, Bandwidth, Compression