The Relevancy Metric: Understanding the Impact of Training Data

Manish Nagaraj; Deepak Ravikumar; Efstathia Soufleri; Kaushik Roy

The Relevancy Metric: Understanding the Impact of Training Data

Manish Nagaraj, Deepak Ravikumar, Efstathia Soufleri, Kaushik Roy

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Train-Test Relationship, Influence functions, Memorization, Learning dynamics, Dataset properties

TL;DR: The paper presents a novel metric, $\textbf{\textit{Relevancy}}$, that quantifies the impact of individual train samples on inference predictions in a scalable and computationally efficient way.

Abstract: Deep learning models are central to many critical decision-making processes, making it imperative to gain deeper insights into their behavior to improve performance, transparency, interpretability, and fairness. A key challenge is understanding how training data shapes model predictions on unseen test data. In this paper, we introduce a novel metric, $\textbf{\textit{Relevancy}}$, which quantifies the impact of individual training samples on inference predictions. Our proposed metric is calculated by observing the learning dynamics of the model during training, and it is computationally efficient and applicable across a wide range of tasks. We demonstrate that it is between $80\times$ and $100,000\times$ more efficient than existing metrics for capturing the train-test relationship. Using $\textit{relevancy}$, we enable the identification of coresets — compact datasets that represent the essence of the training distribution. Quantitative evaluations show that coresets selected using our metric outperform state-of-the-art methods by up to $5.2$% on CIFAR-100. Additionally, we qualitatively demonstrate how $\textit{relevancy}$ can be extended to assess various training data properties, such as identifying mislabeled samples in widely used datasets like ImageNet, CIFAR-100, and Fashion-MNIST. These examples illustrate just a few of the many potential uses of $\textit{relevancy}$, highlighting its versatility in promoting more interpretable, efficient, and fair deep learning systems across diverse tasks.

Primary Area: interpretability and explainable AI

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11325

Loading