Keywords: data sub-sampling, data curation, self supervised learning, optimal transport
Abstract: Real-world datasets often display inherent imbalances in the distribution of classes or concepts. Recent studies indicate that such imbalances can lead to suboptimal performances of Self-Supervised Learning (SSL) models when evaluated across the full spectrum of concepts. To address this issue, we propose a data curation method that automatically selects a balanced subset of the data. This problem is approached as a graph matching task, where the goal is to identify a data subset that is most distinct in terms of pairwise similarities. We achieve this by mapping an isolated graph onto the similarity graph of the input data, leveraging the optimal transport semi-unbalanced Gromov-Wasserstein problem. We demonstrate that this problem can be solved with linear complexity and is well-suited for GPU acceleration. The effectiveness of our method is validated through experiments on small datasets, setting the stage for future exploration on larger-scale problems.
Submission Number: 18
Loading