Corgi$^2$: A Hybrid Offline-Online Approach To Storage-Aware Data Shuffling For SGD

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Data Shuffle, SGD, Theoretical Guarantees
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: An efficient alternative to fully shuffling data before training, with theoretical analysis and empirical evidence about the tradeoffs
Abstract: When using Stochastic Gradient Descent (SGD) for training machine learning models, it is often crucial to provide the model with examples sampled at random from the dataset. However, for large datasets stored in the cloud, random access to individual examples can be costly and inefficient. A recent work proposed an online shuffling algorithm called CorgiPile, which greatly improves efficiency of data access, at the cost of some performance loss, which becomes particularly apparent for large datasets stored in homogeneous shards (e.g., video datasets). In this paper, we extend CorgiPile by adding an efficient offline iteration, transforming it into a hybrid two-step partial data shuffling strategy. We show through comprehensive theoretical analysis that our approach performs similarly to SGD with random access (even for homogenous data) without compromising on the data access efficiency of CorgiPile, and demonstrate its practical advantages through experimental results.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5159
Loading