E pluribus unum interpretable convolutional neural networks

Published: 23 Jun 2025, Last Modified: 23 Jun 2025Greeks in AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision and Learning
TL;DR: We propose a framework for developing inherently interpretable Convolutional Neural Network models based on Generalized Additive Models.
Abstract: Convolutional neural networks (CNNs) have achieved remarkable predictive performance across a wide range of computer vision tasks, yet their widespread deployment in safety-critical domains remains hampered by a lack of transparency in how individual image features drive classification decisions. In this work, we introduce E pluribus unum interpretable CNN (EPU-CNN) [1], a framework that makes any base CNN architecture inherently interpretable offering human-aligned explanations grounded in perceptual features. Drawing inspiration from Generalized Additive Models (GAMs), an EPU-CNN decomposes the input image into a set of orthogonal Perceptual Feature Maps (PFMs), e.g., color opponency channels (blue–yellow, green-red) and texture representations (light–dark, coarse–fine), and then processes each PFM through a parallel CNN sub-network. The final prediction of an EPU-CNN model is obtained by summing the outputs of the subnetworks, which serve as Relative Similarity Scores (RSSs) directly quantifying the contribution of each perceptual feature to the inferred decision. To visualize where each perceptual feature mostly contributes in the image, we offer spatially resolved Perceptual Relevance Maps (PRMs) that are derived by back-projecting feature activations, which guide user’s attention offering important insights into the decision making process of the model. EPU-CNN was validated across various benchmark datasets, including a custom “Banapple” dataset which was proposed to validate the interpretability capabilities of any feature-attribution based interpretable method, designed to separate color‐ and shape‐based features of target objects, three endoscopic repositories (KID, Kvasir, EndoVis2015), the ISIC 2019 dermoscopic collection, and four standard vision datasets (CIFAR-10, MNIST, Fashion-MNIST, iBean). EPU-CNN variants built on lightweight CNN architectures as well as on VGG16, ResNet50, and DenseNet169 match or exceed the classification accuracy of their non-interpretable counterparts, while offering immediate, quantitative interpretability. It has been demonstrated that including all four PFMs yield the best performance on color‐ and texture‐sensitive tasks, whereas chromatic features alone suffice for gastrointestinal lesion detection. Qualitative analysis of PRMs and global/local RSS bar charts reveal that EPU-CNN reliably highlights clinically relevant tissue variations (e.g., blood in endoscopy or asymmetries in dermoscopy), and that synthetic perturbations in shape or color led to coherent shifts in the explanations of the model. Finally, using the ROAD metric, a well-known measure in the literature for evaluating interpretability, we compared against a GradCAM and a GradCAM–based ensemble, showing that EPU-CNN delivers more faithful interpretations. By embedding human-perceivable, opponent‐feature representations directly into the architecture of a CNN, EPU-CNN bridges the gap between predictive power and explanatory transparency. Its modular, architecture-agnostic design facilitates rapid adoption in domains requiring both high accuracy and regulatory compliance, such as medical imaging, agriculture, and food quality assessment. Future work will explore automated PFM discovery and user‐centered evaluation with domain experts to further refine and extend this framework. Keyword: Vision and Learning Reference: [1] Dimas, G., Cholopoulou, E., & Iakovidis, D. K. (2023). E pluribus unum interpretable convolutional neural networks. Scientific Reports, 13(1), 11421. (https://www.nature.com/articles/s41598-023-38459-1)
Submission Number: 81
Loading