Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution
Track: Long Paper Track (up to 9 pages)
Keywords: Interpretability, Attribution, Explainability
Abstract: The increasing complexity of AI systems has made understanding their behavior and building trust in them a critical challenge, especially for large language models. Numerous methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of approaches and terminology. We argues that feature, data, and component attribution methods share fundamental similarities, and bridging them can benefit interpretability research. We conduct a detailed analysis of successful methods of these three attribution aspects and present a unified view to demonstrate that they employ similar approaches: perturbations, gradients, and linear approximations. Our unified view enhances understanding of attribution methods and highlights new directions for interpretability and broader AI areas, including model editing, steering, and regulation.
Submission Number: 36
Loading