Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution

Shichang Zhang; Tessa Han; Usha Bhalla; Himabindu Lakkaraju

Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution

Shichang Zhang, Tessa Han, Usha Bhalla, Himabindu Lakkaraju

Published: 05 Mar 2025, Last Modified: 06 Mar 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0

Track: Long Paper Track (up to 9 pages)

Keywords: Interpretability, Attribution, Explainability

Abstract: The increasing complexity of AI systems has made understanding their behavior and building trust in them a critical challenge, especially for large language models. Numerous methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of approaches and terminology. We argues that feature, data, and component attribution methods share fundamental similarities, and bridging them can benefit interpretability research. We conduct a detailed analysis of successful methods of these three attribution aspects and present a unified view to demonstrate that they employ similar approaches: perturbations, gradients, and linear approximations. Our unified view enhances understanding of attribution methods and highlights new directions for interpretability and broader AI areas, including model editing, steering, and regulation.

Submission Number: 36

Loading