['3c3', '< Abstract: Establishing correspondence between images or scenes is a significant challenge in computer vision, especially given occlusions, viewpoint changes, and varying object appearances. In this paper, we present Siamese Masked Autoencoders (SiamMAE), a simple extension of Masked Autoencoders (MAE) for learning visual correspondence from videos. SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them. These frames are processed independently by an encoder network, and a decoder composed of a sequence of cross-attention layers is tasked with predicting the missing patches in the future frame. By masking a large fraction (95%) of patches in the future frame while leaving the past frame unchanged, SiamMAE encourages the network to focus on object motion and learn object-centric representations. Despite its conceptual simplicity, features learned via SiamMAE outperform state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks. SiamMAE achieves competitive results without relying on data augmentation, handcrafted tracking-based pretext tasks, or other techniques to prevent representational collapse.', '---', '> Abstract: Accurate visual correspondence across dynamic scenes, robust to occlusions and viewpoint changes, remains a foundational challenge in computer vision. We introduce Siamese Masked Autoencoders (SiamMAE), a novel self-supervised framework extending Masked Autoencoders (MAE) to effectively learn visual correspondence from videos. SiamMAE leverages an asymmetric masking strategy, processing pairs of randomly sampled video frames: a fully visible past frame and a heavily masked (95%) future frame. A siamese encoder processes these frames independently, while a cross-attention decoder reconstructs the missing patches in the future frame. This asymmetric approach compels the network to explicitly model object motion and learn robust object-centric representations, crucial for dense correspondence. Despite its conceptual simplicity, SiamMAE achieves state-of-the-art performance across video object segmentation, pose keypoint propagation, and semantic part propagation tasks, significantly outperforming prior self-supervised methods. Crucially, SiamMAE attains these competitive results without recourse to extensive data augmentation, handcrafted tracking-based pretext tasks, or collapse-prevention mechanisms.', '6,11c6', '< "The distinction between the past, present, and future is only a stubbornly persistent illusion."', '< -Albert Einstein Time is a special dimension in the context of visual learning, providing the structure within which sequential events are perceived, cause-effect relationships are learned, objects are tracked as they move through space, and future events are predicted. Central to all of these capabilities is the ability to establish visual correspondence over time. Our visual system is adept at establishing correspondence between scenes despite occlusions, viewpoint changes, and object transformations. This capability is unsupervised, critical to human visual perception, and remains a significant challenge in computer vision. Equipping machines with such a capability enables a wide range of applications such as object segmentation and tracking in videos, depth and optical flow estimation, and 3D reconstruction [1][2][3][4][5][6][7][8].', '< A powerful self-supervised learning paradigm is predictive learning, i.e., predicting any unobserved or hidden part of the signal from any observed or unhidden part of the signal [9]. Notably, this form of predictive learning has been used for learning correspondences [10][11][12] by predicting the colors of grayscale future frame by observing a (colorful) past reference frame. However, the performance of these methods has trailed behind contrastive self-supervised learning [13] approaches. State-ofthe-art methods [14][15][16][17] for learning correspondence primarily employ some form of contrastive learning [13]. Intuitively, contrastive learning-based approaches are well-suited for the task of learning correspondence, as they utilize extensive data augmentation to learn features invariant to changes in pose, lighting, viewpoint, and other factors. However, a major criticism of contrastive approaches is their reliance on careful selection of augmentations to learn useful invariances [18], along with a suite of additional components [19,20,17,21] to prevent representational collapse. During pre-training we randomly sample a pair of video frames and randomly mask a huge fraction (95%) of patches of the future frame while leaving the past frame unchanged. The two frames are processed independently by a siamese encoder parametrized by a ViT [31]. The decoder consists of a sequence of cross-attention layers and predicts missing patches in the future frame. Videos available at this project page.', '< Recently, predictive learning methods like masked language modelling [22,23] and masked visual modeling (MVM) [24][25][26] have demonstrated promising results in natural language processing and computer vision domains. MVM methods like Masked Autoencoders (MAE) learn good visual representations without relying on data augmentation by learning to reconstruct the missing patches from randomly masked input image patches. Extending MVM methods from images to videos for learning correspondence is however nontrivial for two reasons. First, features learned by MAEs are specialized for the pixel reconstruction task, which show excellent downstream performance on finetuning, but do not transfer well in zero-shot settings. Second, existing extensions of MAEs in the video domain [27,28] also symmetrically mask a huge fraction of patches across all frames. Unlike images, which are (approximately) isotropic [29], the temporal dimension is special [30], and not all spatio-temporal orientations are equally likely. Hence, symmetrically treating spatial and temporal information might be sub-optimal. Indeed, MAEs trained on videos do not outperform MAEs trained on ImageNet on video instance tracking benchmarks (Table 1).', '< To address these limitations, we present Siamese Masked Autoencoders (SiamMAE): a simple extension of MAEs for learning visual correspondence from videos. In our approach, two frames are randomly selected from a video clip, with the future frame having a significant portion (95%) of its patches randomly masked, while the past frame is left intact. These frames are processed independently by an encoder network, and a decoder composed of a sequence of cross-attention layers is tasked with predicting the missing patches in the future frame. Our asymmetric masking approach encourages the network to model object motion, or in other words, to understand what went where [32]. Simple extensions of MAEs to frames with symmetric masking wastes model capacity on modeling low-level image details. However, by providing the entire past frame as input, our network is primarily focused on propagating the patches from the past frame to their corresponding locations in the future frame. The cross-attention layers in our decoder serve a function akin to the affinity matrix often employed in self-supervised correspondence learning approaches. Empirically, we find that the combination of asymmetric masking, a siamese encoder, and our decoder can effectively learn features suitable for tasks requiring fine-grained and object-level correspondence.', '< Despite the conceptual simplicity of our method, it outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation. Moreover, our ViT-S/16 models significantly outperform larger models trained on ImageNet (+8.5% J &F m for ViT-B) and Kinetics-400 (+7.4% J &F m for ViT-L) via MVM in video object segmentation tasks. We also observe significant performance gains across all tasks with models trained with smaller patch sizes (ViT-S/8, ViT-B/8). SiamMAE achieves competitive results without relying on data augmentation [16,17], handcrafted tracking-based pretext tasks [15,14], multi-crop training [17], additional techniques to prevent representational collapse [10,11,17,16] or enhance performance [11]. We believe that our detailed analysis, straightforward approach, and state-of-the-art performance can serve as a robust baseline for self-supervised correspondence learning.', '---', '> Establishing robust visual correspondence across dynamic scenes, despite occlusions, viewpoint changes, and varying object appearances, is a fundamental and long-standing challenge in computer vision. This capability is crucial for understanding sequential events, learning cause-effect relationships, tracking objects, and predicting future actions, mirroring a critical, unsupervised aspect of human visual perception. Equipping machines with this ability unlocks a wide array of applications, including video object segmentation, depth and optical flow estimation, 3D reconstruction, and visual tracking [1][2][3][4][5][6][7][8].', '12a8,15', '> A prominent self-supervised learning paradigm is predictive learning, which involves inferring unobserved parts of a signal from observed parts [9]. While predictive methods have been explored for correspondence learning, such as predicting future frame colors from past grayscale frames [10][11][12], their performance has historically lagged behind contrastive self-supervised learning approaches [13]. State-of-the-art methods [14][15][16][17] for visual correspondence primarily rely on contrastive learning, leveraging extensive data augmentation to learn features invariant to various transformations. However, these methods are often criticized for their dependence on carefully chosen augmentations [18] and a complex suite of additional components [19,20,17,21] to prevent representational collapse.', '> ', '> More recently, predictive learning methods, exemplified by masked language modeling [22,23] and masked visual modeling (MVM) [24][25][26], have achieved remarkable success in NLP and computer vision. Masked Autoencoders (MAE) [24] learn powerful visual representations by reconstructing masked image patches, notably without relying on data augmentation. However, directly extending MVM from images to videos for correspondence learning presents two key challenges. First, MAE features are optimized for pixel reconstruction, excelling in fine-tuning but showing limited zero-shot transferability. Second, existing video MAE extensions [27,28] symmetrically mask a large fraction of patches across all frames. Given that the temporal dimension is inherently distinct from spatial dimensions [29,30], treating spatio-temporal information symmetrically can be sub-optimal. Indeed, video MAEs often fail to outperform ImageNet-pretrained MAEs on video instance tracking benchmarks (Table 1), indicating a missed opportunity for temporal learning.', '> ', '> To overcome these limitations, we propose Siamese Masked Autoencoders (SiamMAE), a novel and conceptually simple extension of MAEs designed for learning robust visual correspondence from videos. SiamMAE employs an asymmetric masking strategy: a full, unmasked past frame and a heavily masked (95%) future frame are randomly sampled from a video clip. These frames are processed independently by a siamese encoder, typically a Vision Transformer (ViT) [31]. A decoder, composed of cross-attention layers, is then tasked with reconstructing the missing patches in the future frame. This asymmetric design compels the network to explicitly model object motion—understanding "what went where" [32]—and to learn object-centric representations, rather than merely reconstructing low-level image details. The cross-attention mechanism in our decoder inherently functions like the affinity matrices commonly used in self-supervised correspondence learning. Empirically, this combination of asymmetric masking, a siamese encoder, and a cross-attention decoder proves highly effective for learning features critical for fine-grained and object-level correspondence.', '> ', "> Despite its simplicity, SiamMAE achieves state-of-the-art performance on video object segmentation, pose keypoint propagation, and semantic part propagation tasks, significantly outperforming prior self-supervised methods. Our ViT-S/16 models, comparable in size to ResNet-50, surpass larger MVM models (e.g., ViT-B on ImageNet, ViT-L on Kinetics-400) by substantial margins (e.g., +8.5% J&Fm for ViT-B in VOS). We also demonstrate significant gains with smaller patch sizes (ViT-S/8, ViT-B/8) across all tasks. Crucially, SiamMAE achieves these competitive results without relying on extensive data augmentation [16,17], handcrafted tracking-based pretext tasks [15,14], multi-crop training [17], or additional mechanisms to prevent representational collapse [10,11,17,16]. We believe SiamMAE's straightforward approach, detailed analysis, and superior performance establish a robust new baseline for self-supervised correspondence learning.", '> ', '14c17', '< Temporal correspondence. The visual world is smooth and continuous [33,34], providing a rich source of information for biological and machine vision systems. In biological vision, infants learn about objects and their properties by establishing temporal correspondence, taking advantage of the inherent smoothness in the visual input [35]. Similarly, in machine vision, learning fine-grained correspondence from video frames is an important problem that has been studied for decades in the form of optical flow and motion estimation [36-42, 5, 6, 43, 44, 7]. However, despite their impressive performance, these methods rely on costly human-annotated or synthetic data with pixel-level ground truth annotations [45,46]. A more semantically meaningful task involves determining object-level correspondence i.e., visual object tracking [47][48][49][50][51][52]. One popular approach is tracking-by-matching methods that utilize deep features learned via supervised [53,54] or self-supervised learning [10-12, 14-16] on videos. State-of-the-art methods [14][15][16][17] for self-supervised feature learning for correspondence primarily employ some form of contrastive learning [13]. Predictive learning has also been used for learning correspondences [10,11] by predicting the target colors for gray-scale input frame by observing a colorful reference frame. However, the performance of these methods has trailed behind contrastive approaches. In this work, we show that predictive learning based methods can be used for learning fine-grained and object-level correspondence.  19,17,20] or only similarity [77,21]. Furthermore, contrastive learning has also been successfully applied to videos [78-82, 82, 83]. However, a limitation of contrastive approaches is their dependence on careful selection of augmentations to learn useful invariances [18], and as well as the need for a suite of additional components [19,20,17,21] to prevent representational collapse.', '---', '> Temporal correspondence. The inherent smoothness and continuity of the visual world [33,34] provide a rich source of information for both biological and machine vision systems. In biological vision, infants leverage this temporal coherence to learn about objects and their properties by establishing correspondence over time [35]. Similarly, in machine vision, learning fine-grained correspondence from video frames is a long-standing and crucial problem, extensively studied through optical flow and motion estimation techniques [36-42, 5, 6, 43, 44, 7]. While these methods achieve impressive performance, they typically demand costly human-annotated or synthetic data with pixel-level ground truth [45,46]. A more semantically meaningful task involves determining object-level correspondence, often addressed through visual object tracking [47][48][49][50][51][52]. Tracking-by-matching approaches, a popular subset, utilize deep features learned via either supervised [53,54] or self-supervised methods [10-12, 14-16] on videos. State-of-the-art self-supervised feature learning for correspondence predominantly employs various forms of contrastive learning [13]. Predictive learning has also been explored for correspondence [10,11], for instance, by predicting target colors for grayscale future frames using colorful reference frames. However, the performance of these predictive methods has historically trailed behind contrastive approaches. In this work, we demonstrate that predictive learning, specifically through an appropriately designed masked autoencoder, can indeed be highly effective for learning both fine-grained and object-level correspondence.', '16,17c19', '< Section: Self', '< Masked autoencoders. Masked autoencoders are a type of denoising autoencoder [84] that learn representations by reconstructing the original input from corrupted (i.e., masked) inputs. The introduction of masked language modeling in BERT [85] has had a transformative impact on the natural language processing field, particularly when scaled to large datasets and model sizes [86,87]. Masked autoencoders have also been successfully adapted to learn representations from images [24][25][26] and videos [28,27]. Our work studies a simple extension of MAEs [24] to videos. However, unlike prior methods [28,27] that symmetrically mask all frames, we propose an asymmetric masking scheme, leaving the past frame unchanged and masking a higher percentage of the future frame.  19,17,20], as their design allows an easy way to learn invariant visual representations from data. Inspired by the success of masked autoencoders, researchers have also explored combining contrastive learning with siamese networks and masked visual modeling [91,92]. However, we are not aware of any previous studies that have investigated siamese masked autoencoders using asymmetric masking for representation learning from videos.', '---', '> Self-supervised learning for videos. Beyond correspondence, self-supervised learning has seen significant advancements in video understanding. Early methods focused on pretext tasks such as predicting future frames [59-64], learning temporal order [65-69], or exploiting ego-motion [70,71]. More recent approaches often leverage contrastive learning, aiming to maximize agreement between different views of the same video clip while minimizing agreement with other clips [78-83]. These methods often incorporate sophisticated architectures, multi-crop strategies, and careful selection of augmentations to achieve robust representations and prevent representational collapse [18,19,20,17,21].', '18a21,22', '> Masked autoencoders. Masked autoencoders are a powerful class of denoising autoencoders [84] that learn representations by reconstructing original inputs from corrupted (i.e., masked) versions. The paradigm of masked language modeling, pioneered by BERT [85], has revolutionized natural language processing, particularly with large-scale models [86,87]. Masked autoencoders have since been successfully adapted to learn representations from images [24][25][26] and videos [28,27]. Our work builds upon the success of MAEs [24] and extends them to video data. Crucially, unlike prior video MAE methods [28,27] that symmetrically mask all frames, we introduce an asymmetric masking scheme. This strategy involves leaving the past frame entirely unmasked while heavily masking the future frame, a design choice we show to be critical for learning temporal correspondence effectively. While some recent works have explored combining contrastive learning with siamese networks and masked visual modeling [91,92], we are the first to investigate siamese masked autoencoders with asymmetric masking specifically for robust correspondence learning from videos.', '> ', '238d241', '< ']
