Keywords: data attribution, influence functions, LLMs, interpretability
TL;DR: We efficiently scale the influence-function-based training data attribution to recent LLMs and their massive training datasets.
Abstract: Training data attribution (TDA) quantifies the contribution of individual training examples to model predictions, enabling a range of applications such as data curation, data citation, and model debugging. However, applying existing TDA methods to recent large models and training datasets has been largely limited by prohibitive compute and memory costs. In this work, we focus on influence functions, a popular gradient-based TDA method, and significantly improve its scalability with an efficient gradient projection strategy called LoGra that leverages the gradient structure in backpropagation. We then provide a theoretical motivation of gradient projection approaches to influence functions to promote trust in the TDA process. Lastly, we lower the barrier to implementing TDA systems by introducing LogIX, a software package that can transform existing training code into TDA code with minimal effort. In our TDA experiments, LoGra achieves competitive accuracy against more expensive baselines while showing up to 6,500x improvement in throughput and 5x reduction in GPU memory usage when applied to Llama3-8B-Instruct and the 1B-token dataset.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9268
Loading