Disentangling the Roles of Representation and Selection in Data Pruning (for Fine-Tuning)

Yupei Du; Yingjin Song; Hugh Mee Wong; Daniil Ignatev; Albert Gatt; Dong Nguyen

Disentangling the Roles of Representation and Selection in Data Pruning (for Fine-Tuning)

Yupei Du, Yingjin Song, Hugh Mee Wong, Daniil Ignatev, Albert Gatt, Dong Nguyen

24 Sept 2024 (modified: 22 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: data pruning, fine-tuning

TL;DR: We disentangled and studied the distinct roles of data representations and selection algorithms in data pruning.

Abstract: Data pruning, the process of carefully selecting a small subset of training data, has been shown to improve both training efficiency and performance. It typically involves two steps: (1) obtaining a representation for each instance, and (2) applying a selection algorithm using these representations. However, the distinct roles of these two steps, as well as their interactions, remain unclear. To address this, we conduct a systematic study of data pruning, focusing on NLP fine-tuning. Our theoretical and empirical findings reveal that data representation often plays a more fundamental role than the selection algorithm: gradients, despite being computationally expensive, provide stronger pruning signals than other representations, making gradient-based methods consistently outperform cheaper alternatives. We also demonstrate that different selection algorithms excel in specific scenarios but are heavily influenced by the chosen representation. These insights provide clear guidelines for future research and practical applications.

Supplementary Material: zip

Primary Area: other topics in machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3587

Loading