Disentangling the Roles of Representation and Selection in Data Pruning (for Fine-Tuning)

24 Sept 2024 (modified: 22 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: data pruning, fine-tuning
TL;DR: We disentangled and studied the distinct roles of data representations and selection algorithms in data pruning.
Abstract: Data pruning, the process of carefully selecting a small subset of training data, has been shown to improve both training efficiency and performance. It typically involves two steps: (1) obtaining a representation for each instance, and (2) applying a selection algorithm using these representations. However, the distinct roles of these two steps, as well as their interactions, remain unclear. To address this, we conduct a systematic study of data pruning, focusing on NLP fine-tuning. Our theoretical and empirical findings reveal that data representation often plays a more fundamental role than the selection algorithm: gradients, despite being computationally expensive, provide stronger pruning signals than other representations, making gradient-based methods consistently outperform cheaper alternatives. We also demonstrate that different selection algorithms excel in specific scenarios but are heavily influenced by the chosen representation. These insights provide clear guidelines for future research and practical applications.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3587
Loading