Fast Training Dataset Attribution via In-Context Learning

Fast Training Dataset Attribution via In-Context Learning

ACL ARR 2025 May Submission456 Authors

12 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We investigate the use of in-context learning and prompt engineering to estimate the contributions of training data in the outputs of instruction-tuned large language models (LLMs). We propose two novel approaches: (1) a similarity-based approach that measures the difference between LLM outputs with and without provided context, and (2) a mixture distribution model approach that frames the problem of identifying contribution scores as a matrix factorization task. Our empirical comparison demonstrates that the mixture model approach is more robust to retrieval noise in in-context learning, providing a more reliable estimation of data contributio

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Data influence

Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Data analysis

Languages Studied: English

Submission Number: 456

Loading