What is Important? Internal Interpretability of Models Processing Data with Inherent Structure

Bartłomiej Wójcik; Arkadiusz Tomczyk

What is Important? Internal Interpretability of Models Processing Data with Inherent Structure

Bartłomiej Wójcik, Arkadiusz Tomczyk

20 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: deep learning, interpretability by design, saliency maps, data with inherent structure

TL;DR: This paper presents a method for interpretable neural networks that quantify input component importance internally, using a two-stage training process, yielding accurate, stable models with causally grounded explanations.

Abstract: This paper introduces a methodology for constructing interpretable neural networks that quantify the importance of structured input components directly within their internal mechanisms, thereby eliminating the need for traditional explanation methods that rely on post-hoc saliency map generation. Our approach features a two-stage training procedure. First, component specific representations and importance scores are discovered using appropriately designed convolutional neural networks, which are trained jointly. Second, an architecture with relaxed structural constraints, leveraging the previously acquired knowledge, is fine-tuned to capture spatial dependencies among components and to integrate global context. We systematically evaluate our method on Oxford Pets, Stanford Cars, CUB-200, Imagenette, and ImageNet, measuring interpretability-performance trade-offs with metrics for semanticity, sparsity, reproducibility, and, when required, causality (via insertion/deletion-inspired scores). Our architecture achieves improved semantic alignment with ground-truth segmentation annotations compared to post-hoc saliency maps, which, when available, serve as surrogates for expected saliency maps. At the same time, it maintains low variance in importance scores across runs, demonstrating strong reproducibility. Crucially, our architecture provides interpretability gains without sacrificing accuracy. In fact, both with non-pretrained and pretrained backbones, it frequently achieves higher predictive performance than parameter-matched baselines. Overall, compared to both conventional models and post-hoc interpretability techniques under matched computational budgets, our framework produces models that are accurate, stable, and that deliver causally grounded explanations.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 23901

Loading