RepMatch: Quantifying Cross-Instance Similarities in Representation Space

ACL ARR 2024 June Submission1665 Authors

14 Jun 2024 (modified: 04 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Advancements in dataset analysis methods have led to the development of ways to analyze and categorize training data instances. These methods categorize the data based on specific features like "difficulty". We propose a framework that categorizes data from a viewpoint of similarity. This framework quantifies the similarities between subsets of training instances by comparing the models trained on them. This approach addresses the limitations of existing methodologies that focus on individual instances and are confined to single-dataset analyses. Our method enables the evaluation of similarities among arbitrary subsets of instances, facilitating both dataset-dataset and instance-dataset analyses. To compare two models efficiently, we leverage the Low-Rank Adaptation (LoRA) method. The effectiveness of our method has been validated across various NLP tasks, datasets, and models. The method can be used to compare datasets, find a smaller subset that outperforms a randomly selected subset of the same size, and successfully uncovers heuristics used in the construction of a challenge dataset.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Dataset Analysis, Instance selection, Out-of-distribution
Contribution Types: Approaches to low-resource settings, Data analysis
Languages Studied: English
Submission Number: 1665
Loading