Abstract: We present a novel masked image modeling (MIM) approach, called context autoencoder (CAE), for self-supervised representation learning. We randomly partition the image into two sets of patches: visible patches and masked patches. The architecture consists of: (i) an encoder that takes visible patches as the input and outputs the latent representations, (ii) a latent context regressor that regresses the masked patch representations from the visible patch representations that are not updated in this regressor, (iii) a decoder that takes the estimated masked patch representations as the input and makes predictions for the masked patches, and (iv) an alignment module that aligns the estimated masked patch representations with the masked patch representations computed from the encoder.
In comparison to previous MIM methods that couple the encoding and decoding roles, e.g., using a single module in BEiT, our approach attempts to separate the encoding role (content understanding) from the task decoding role (making predictions for masked patches) using different modules, improving the content understanding capability. In addition, our approach makes predictions from the visible patches to the masked patches in the latent representation space that is expected to take on semantics. We demonstrate the effectivess of our CAE through superior transfer performance in downstream tasks, semantic segmentation and object detection.
0 Replies
Loading