Multiplex network-based representation of vision transformers for visual explainability

Michele Marchetti, Davide Traini, Domenico Ursino, Luca Virgili

Published: 2025, Last Modified: 07 Nov 2025Neural Comput. Appl. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The enormous growth of artificial intelligence (AI), and deep learning (DL) in particular, has led to the widespread use of these systems in a variety of contexts. One DL model capable of addressing complex computer vision tasks is the vision transformer (ViT). Despite its huge success, the reasoning behind the inferences it makes is often unclear, which poses significant challenges in critical scenarios. In this paper, we propose a new approach called MUltiplex Transformer EXplainer (MUTEX), which aims to explain the inferences made by ViTs. MUTEX combines multiplex network-based representations of attention matrices and mask perturbation approaches to provide insight into the inference process of ViTs. By mapping the attention layers of a ViT into a multiplex network, MUTEX is able to analyze the relationships between different parts of the input image and identify the image patches that most influence the inference process. We tested MUTEX on a subset of ImageNet and on BloodMNIST and compared its performance with that of existing visual explainability approaches. In addition, to assess the robustness and adaptability of MUTEX, we conducted a qualitative analysis, along with a hyperparameter and ablation study, which allowed us to further appreciate its potential in visual explainability of ViT.

External IDs:dblp:journals/nca/MarchettiTUV25