Global Patch-wise Attention is Masterful Facilitator for Masked Image Modeling

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Masked image modeling (MIM), as a self-supervised learning paradigm in computer vision, has gained widespread attention among researchers. MIM operates by training the model to predict masked patches of the image. Given the sparse nature of image semantics, it is imperative to devise a masking strategy that steers the model towards reconstructing high-semantic regions. However, conventional mask strategies often miss these high-semantic regions or lack alignment with the masks and semantics. To solve this, we propose the Global Patch-wise Attention (GPA) framework, a transferable and efficient framework for MIM pre-training. We observe that the attention between patches can be the metric of identifying high-semantic regions, which can guide the model to learn more effective representations. Therefore, we firstly define the global patch-wise attention via vision transformer blocks. Then we design the soft-to-hard mask generation to guide the model gradually focusing on high semantic regions identified by GPA (GPA as a teacher). Finally, we design an extra task to predict GPA (GPA as a feature). Experiments conducted under various settings demonstrate that our proposed GPA framework enables MIM to learn better representations, which benefit the model across a wide range of downstream tasks. Furthermore, our GPA framework can be easily and effectively transferred to various MIM architectures.
Primary Subject Area: [Content] Media Interpretation
Relevance To Conference: In this work, we propose the Global Patch-wise Attention (GPA) framework, a transferable and efficient paradigm for the learning of visual representations. Leveraging the GPA framework enables us to attain superior visual representations with significantly reduced training iterations (-62.5% in our experiments) and enhanced performance (+18.2% linear probing accuracy compared to the state-of-the-art masking strategies). Learning visual representations is a pivotal issue in the realm of multimedia, by facilitating more efficient and effective visual representation learning, GPA enables multimedia systems to more rapidly and accurately interpret complex visual information, essential for advanced multimedia interpretation tasks. The framework's ability to significantly reduce training time while improving representation quality aligns with the multimedia field's demand for fast, accurate, and robust systems capable of processing and integrating diverse modalities. This is particularly relevant for tasks requiring nuanced understanding and generation of multimedia content, where enhanced visual representations can lead to improved inference, knowledge extraction, and content creation. Thus, GPA not only addresses a core issue in multimedia processing but also offers a scalable solution adaptable to various multimedia applications, promoting advancements in how we interpret, interact with, and generate multimedia content.
Supplementary Material: zip
Submission Number: 3376
Loading