[Short] Beyond Data Size: Exploring the Impact of Dataset Diversity and Density in Self-Distillation Learning

Published: 02 Mar 2026, Last Modified: 02 Apr 2026ICLR 2026 Workshop DATA-FMEveryoneRevisionsCC BY 4.0
Keywords: Self-Supervised Learning, Scaling Laws, Data Curation, Remote Sensing
TL;DR: We propose a predictive scaling law for self-distillation learning that jointly models the impact of unique data samples, data repetition and data diversity.
Abstract: Current scaling laws suggest that maximizing unique data is key to superior pre-training. For self-distillation models like iBOT, we show that data density (repetition) and data diversity (as measured by Vendi score) can be as critical as data size (the total number of unique samples). Wide range of experiments on a large remote sensing dataset demonstrate that seeing a smaller, high-quality subset multiple times outperforms a single pass over a massive stream of unique samples under equivalent compute. Based on these results, we propose a predictive scaling law that models downstream performance as a joint function of unique data size, data density and data diversity. We demonstrate the extrapolation power of the proposed formula.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 165
Loading