Sum-of-Parts: Self-Attributing Neural Networks with End-to-End Learning of Feature Groups

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We develop Sum-of-Parts (SOP), a framework that transforms any differentiable model into a self-attributing model neural network whose predictions can be attributed to groups of features.
Abstract:

Self-attributing neural networks (SANNs) present a potential path towards interpretable models for high-dimensional problems, but often face significant trade-offs in performance. In this work, we formally prove a lower bound on errors of per-feature SANNs, whereas group-based SANNs can achieve zero error and thus high performance. Motivated by these insights, we propose Sum-of-Parts (SOP), a framework that transforms any differentiable model into a group-based SANN, where feature groups are learned end-to-end without group supervision. SOP achieves state-of-the-art performance for SANNs on vision and language tasks, and we validate that the groups are interpretable on a range of quantitative and semantic metrics. We further validate the utility of SOP explanations in model debugging and cosmological scientific discovery.

Lay Summary:

Machine learning models are incredibly powerful, but often work like "black boxes" --- we can't understand how they make decisions. This makes it hard to trust them in important applications like medical diagnosis or scientific research.

We developed a new approach that combines the power of deep learning with interpretability. Our method breaks down the input into meaningful groups of features and uses a powerful deep learning model to analyze each group separately. These predictions from all groups are then combined using simple linear weights, so we can see exactly how much each group contributes to the final decision.

This framework automatically learns which features should be grouped together during training, and it works with any existing pretrained model. When we tested it on cosmology data, our method achieved the best performance among interpretable models that explicitly show which input features they rely on. The learned feature groups corresponded to meaningful physical concepts that helped cosmologists gain new scientific insights about the universe.

Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: explainability, interpretability, faithfulness, self-explaining models, feature attribution
Submission Number: 3311
Loading