Understanding Transformer-Based Vision Models via Modular Feature Inversion

Understanding Transformer-Based Vision Models via Modular Feature Inversion

04 Apr 2026 (modified: 24 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Understanding the internal mechanisms of deep neural networks remains a central challenge in machine learning. In computer vision, one promising yet only preliminarily explored approach is feature inversion via inverse networks, which reconstructs images from intermediate representations using trained inverse networks. In this study, we revisit feature inversion via inverse networks, introducing a novel, modular variant that enables a computationally more efficient application of the technique while producing semantically more coherent image reconstructions. We apply our method to large-scale transformer-based vision models, specifically Detection Transformer, Vision Transformer, Swin Transformer, and Data-Efficient Image Transformer, analyzing the resulting reconstructions across network depth. Our results reveal shared properties and systematic differences in how these architectures process visual information, including their handling of contextual shape, fine-grained image detail, inter-layer representational similarity, and robustness to color perturbations. These findings contribute to a deeper understanding of transformer-based vision models and demonstrate the utility of modular feature inversion as an interpretability tool.

Submission Type: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=qcgWlzKiCl

Changes Since Last Submission: This manuscript is resubmitted as a substantially revised and extended version of the previous submission. The revision expands the empirical scope by incorporating two additional architectures, DeiT and Swin, enabling a more comprehensive cross-model analysis. We further introduce a dedicated evaluation of the computational efficiency of the proposed method. In addition, the paper now includes a detailed quantitative assessment of reconstruction quality using various pixel-level and perceptual metrics, complementing the qualitative analysis. We also conduct additional experiments to better disentangle the origins of observed differences across models, including an analysis of prototypical representations. Overall, the manuscript has been significantly strengthened in terms of scope, and analytical depth.

Assigned Action Editor: ~Ming-Hsuan_Yang1

Submission Number: 8257

Loading