PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling

Junmyeong Lee; Eui Jun Hwang; Sukmin Cho; Jong C. Park

PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling

Junmyeong Lee, Eui Jun Hwang, Sukmin Cho, Jong C. Park

Published: 13 Oct 2024, Last Modified: 02 Dec 2024NeurIPS 2024 Workshop SSLEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Self-supervised Learning, Representation Learning, Masked Image Modeling

TL;DR: We propose PiLaMIM, a unified framework combining Pixel MIM and Latent MIM to leverage their complementary strengths, capturing both high-level and low-level visual features for richer visual representations.

Abstract: In Masked Image Modeling (MIM), two primary methods exist: Pixel MIM and Latent MIM, each utilizing different reconstruction targets, raw pixels and latent representations, respectively. Pixel MIM tends to capture low-level visual details such as color and texture, while Latent MIM focuses on high-level semantics of an object. However, these distinct strengths of each method can lead to suboptimal performance in tasks that rely on a particular level of visual features. To address this limitation, we propose PiLaMIM, a unified framework that combines Pixel MIM and Latent MIM to integrate their complementary strengths. Our method uses a single encoder along with two distinct decoders: one for predicting pixel values and another for latent representations, ensuring the capture of both high-level and low-level visual features. We further integrate the $\texttt{[CLS]}\$ token into the reconstruction process to aggregate global context, enabling the model to capture more semantic information. Extensive experiments demonstrate that PiLaMIM outperforms key baselines such as MAE, I-JEPA and BootMAE in most cases, proving its effectiveness in extracting richer visual representations.

Submission Number: 75

Loading