TL;DR: We introduce sliced optimal transport dataset distance, a model-agnostic, embedding- agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets.
Abstract: We introduce sliced optimal transport dataset distance (s-OTDD), a model-agnostic, embedding-agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets. The core innovation is Moment Transform Projection (MTP), which maps a label, represented as a distribution over features, to a real number. Using MTP, we derive a data point projection that transforms datasets into one-dimensional distributions. The s-OTDD is defined as the expected Wasserstein distance between the projected distributions, with respect to random projection parameters. Leveraging the closed form solution of one-dimensional optimal transport, s-OTDD achieves (near-)linear computational complexity in the number of data points and feature dimensions and is independent of the number of classes. With its geometrically meaningful projection, s-OTDD strongly correlates with the optimal transport dataset distance while being more efficient than existing dataset discrepancy measures. Moreover, it correlates well with the performance gap in transfer learning and classification accuracy in data augmentation.
Lay Summary: It is challenging to compare datasets when they vary in size, shape, or the number of classes. We developed a novel method called the sliced optimal transport dataset distance (s-OTDD) that can solve such issues. s-OTDD does not need training and can operate even if datasets have totally different labels. The s-OTDD principle involves using a method called Moment Transform Projection (MTP) to change complex data into simple numbers. Projecting data sets onto a single dimension makes it easy to identify similarity. The process is efficient, accurate, and can be easily applied to big data sets. We used s-OTDD and found that it is in excellent agreement with existing methods which are much slower and also can predict how well a model trained on one dataset will perform on another. This puts s-OTDD as a great tool for machine learning researchers who need to compare various data in a convenient way.
Link To Code: https://github.com/hainn2803/s-OTDD
Primary Area: General Machine Learning->Everything Else
Keywords: Dataset distance, sliced optimal transport, data-centric machine learning, transfer learning
Submission Number: 6954
Loading