Peeling Back the Layers: Interpreting the Storytelling of ViT

Jingjie Zeng; Zhihao Yang; Qi Yang; Liang Yang; Hongfei Lin

Peeling Back the Layers: Interpreting the Storytelling of ViT

Jingjie Zeng, Zhihao Yang, Qi Yang, Liang Yang, Hongfei Lin

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0

Abstract: By integrating various modules with the Visual Transformer (ViT), we facilitate a interpretation of image processing across each layer and attention head. This method allows us to explore the connections both within and across the layers, enabling a analysis of how images are processed at different layers. Conducting a analysis of the contributions from each layer and attention head, shedding light on the intricate interactions and functionalities within the model's layers. This in-depth exploration not only highlights the visual cues between layers but also examines their capacity to navigate the transition from abstract concepts to tangible objects. It unveils the model's mechanism to building an understanding of images, providing a strategy for adjusting attention heads between layers, thus enabling targeted pruning and enhancement of performance for specific tasks. Our research indicates that achieving a scalable understanding of transformer models is within reach, offering ways for the refinement and enhancement of such models.

Primary Subject Area: [Content] Media Interpretation

Secondary Subject Area: [Content] Vision and Language

Relevance To Conference: This work significantly contributes to multimedia/multimodal processing by enhancing the understanding and interpretation of images through the integration of various modules with the Visual Transformer (ViT). By enabling a detailed examination of image processing across each layer and attention head, the approach facilitates a nuanced understanding of the visual information flow and the role of each component in interpreting complex visual inputs. This deep dive into the mechanics of how images are processed and represented at different layers enables a more sophisticated analysis of visual data, crucial for multimedia applications where the integration of various data types and their interpretations are key.

Supplementary Material: zip

Submission Number: 4397

Loading