Fast Training Dataset Attribution via In-Context Learning

Published: 18 Jun 2024, Last Modified: 20 Jul 2024ICML 2024 Workshop ICL PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: short paper (up to 4 pages)
Keywords: In Context Learning, Training Data Attribution, Prompt Engineering, Matrix Factorization
TL;DR: Training Dataset Attribution is much faster with our in-context learning algorithms.
Abstract: We investigate the use of in-context learning and prompt engineering to estimate the contributions of training data in the outputs of instruction-tuned large language models (LLMs). We propose two novel approaches: (1) a similarity-based approach that measures the difference between LLM outputs with and without provided context, and (2) a mixture distribution model approach that frames the problem of identifying contribution scores as a matrix factorization task. Our empirical comparison demonstrates that the mixture model approach is more robust to retrieval noise in in-context learning, providing a more reliable estimation of data contributions.
Submission Number: 24
Loading