A universal metric of dataset similarity for multi-source learning

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: transfer learning, meta learning, and lifelong learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: multi-source learning, dataset similarity, optimal transport, privacy-preservation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: The study introduces a new metric for assessing dataset similarity that is dataset-agnostic and provides interpretable results that correlate robustly with multi-source learning performance.
Abstract: Multi-source learning is a machine learning approach that involves training on data from multiple sources. Applied domains such as healthcare and finance have been increasingly using multi-source learning to improve model performance. However, datasets collected from different sources can be non-identically distributed, leading to degradation in model performance. Most existing methods for assessing dataset similarity are limited by being dataset or task-specific. They propose similarity metrics that are either unbounded and dependent on dataset dimension and scale, or require model-training. Moreover, these metrics can only be calculated by exchanging data across sources, which can be a privacy concern in domains such as healthcare and finance. To address these challenges, we propose a novel bounded metric for assessing dataset similarity. Our metric exhibits several desirable properties: it is dataset-agnostic, considers label information, and requires no model training. First, we establish a theoretical connection between our metric and the learning process. Next, we extensively evaluate our metric on a range of real-world datasets and demonstrate that our cost metric assigns scores that align with how these data were collected. Further, we show a robust and interpretable relationship between our metric and multi-source learning performance. Finally, we provide a privacy-preserving method to calculate our metric. Our metric can provide valuable insights for deep learning practitioners using multi-source datasets.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6108
Loading