Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution
Keywords: Interpretability, Attribution, Explainability
Abstract: The increasing complexity of AI systems has made understanding their behavior a critical challenge, especially for foundation models. Numerous methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of approaches and terminology. This position paper argues that feature, data, and component attribution methods share fundamental similarities, and bridging them can benefit interpretability research. We conduct a detailed analysis of successful methods of these three attribution aspects and present a unified view to demonstrate that these seemingly distinct methods employ similar approaches, such as perturbations, gradients, and linear approximations, differing primarily in their perspectives rather than core techniques. Our unified perspective enhances understanding of existing attribution methods, identifies shared concepts and challenges, makes this field more accessible to newcomers, and highlights new directions not only for attribution and interpretability but also for broader AI research, including model editing, steering, and regulation. Ultimately, facilitating research of foundation models.
Submission Number: 63
Loading