Accountability Attribution: Tracing Model Behavior to Training Processes

Published: 01 Jul 2025, Last Modified: 01 Jul 2025ICML 2025 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Accountability, Responsibility, Interpretability, Trustworthy AI, Attribution, Learning Dynamics
Abstract: As foundation models (FMs) advance toward artificial general intelligence with increasingly complex training pipelines—involving pretraining, fine-tuning rounds, and subsequent adaptation or alignment—ensuring their reliable and responsible development becomes critical. A key challenge in this context is accountability: when a deployed model exhibits concerning or beneficial behaviors, which training stage is responsible, and to what extent? We pose the problem of \textbf{accountability attribution}, which aims to trace model behavior back to specific stages of the training process, enabling transparency and auditability essential for responsible AI development. To address this, we propose a general framework that answers counterfactual questions about stage effects: \textit{how would the model's behavior have changed if the updates from a training stage had not been executed?}. Within this framework, we introduce estimators based on first-order approximations that efficiently quantify the stage effects without retraining. Our estimators account for both the training data and key aspects of optimization dynamics, including learning rate schedules, momentum, and weight decay. Through experiments on high-stakes domains like medicine and finance, we demonstrate that our approach identifies training stages accountable for specific behaviors, offering a practical tool for ensuring reliable and responsible foundation model development.
Submission Number: 89
Loading