Title: Siamese Masked Autoencoders

Abstract: Establishing correspondence between images or scenes is a significant challenge in computer vision, especially given occlusions, viewpoint changes, and varying object appearances. In this paper, we present Siamese Masked Autoencoders (SiamMAE), a simple extension of Masked Autoencoders (MAE) for learning visual correspondence from videos. SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them. These frames are processed independently by an encoder network, and a decoder composed of a sequence of cross-attention layers is tasked with predicting the missing patches in the future frame. By masking a large fraction (95%) of patches in the future frame while leaving the past frame unchanged, SiamMAE encourages the network to focus on object motion and learn object-centric representations. Despite its conceptual simplicity, features learned via SiamMAE outperform state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks. SiamMAE achieves competitive results without relying on data augmentation, handcrafted tracking-based pretext tasks, or other techniques to prevent representational collapse.

Section: Introduction
"The distinction between the past, present, and future is only a stubbornly persistent illusion."
-Albert Einstein Time is a special dimension in the context of visual learning, providing the structure within which sequential events are perceived, cause-effect relationships are learned, objects are tracked as they move through space, and future events are predicted. Central to all of these capabilities is the ability to establish visual correspondence over time. Our visual system is adept at establishing correspondence between scenes despite occlusions, viewpoint changes, and object transformations. This capability is unsupervised, critical to human visual perception, and remains a significant challenge in computer vision. Equipping machines with such a capability enables a wide range of applications such as object segmentation and tracking in videos, depth and optical flow estimation, and 3D reconstruction [1][2][3][4][5][6][7][8].
A powerful self-supervised learning paradigm is predictive learning, i.e., predicting any unobserved or hidden part of the signal from any observed or unhidden part of the signal [9]. Notably, this form of predictive learning has been used for learning correspondences [10][11][12] by predicting the colors of grayscale future frame by observing a (colorful) past reference frame. However, the performance of these methods has trailed behind contrastive self-supervised learning [13] approaches. State-ofthe-art methods [14][15][16][17] for learning correspondence primarily employ some form of contrastive learning [13]. Intuitively, contrastive learning-based approaches are well-suited for the task of learning correspondence, as they utilize extensive data augmentation to learn features invariant to changes in pose, lighting, viewpoint, and other factors. However, a major criticism of contrastive approaches is their reliance on careful selection of augmentations to learn useful invariances [18], along with a suite of additional components [19,20,17,21] to prevent representational collapse. During pre-training we randomly sample a pair of video frames and randomly mask a huge fraction (95%) of patches of the future frame while leaving the past frame unchanged. The two frames are processed independently by a siamese encoder parametrized by a ViT [31]. The decoder consists of a sequence of cross-attention layers and predicts missing patches in the future frame. Videos available at this project page.
Recently, predictive learning methods like masked language modelling [22,23] and masked visual modeling (MVM) [24][25][26] have demonstrated promising results in natural language processing and computer vision domains. MVM methods like Masked Autoencoders (MAE) learn good visual representations without relying on data augmentation by learning to reconstruct the missing patches from randomly masked input image patches. Extending MVM methods from images to videos for learning correspondence is however nontrivial for two reasons. First, features learned by MAEs are specialized for the pixel reconstruction task, which show excellent downstream performance on finetuning, but do not transfer well in zero-shot settings. Second, existing extensions of MAEs in the video domain [27,28] also symmetrically mask a huge fraction of patches across all frames. Unlike images, which are (approximately) isotropic [29], the temporal dimension is special [30], and not all spatio-temporal orientations are equally likely. Hence, symmetrically treating spatial and temporal information might be sub-optimal. Indeed, MAEs trained on videos do not outperform MAEs trained on ImageNet on video instance tracking benchmarks (Table 1).
To address these limitations, we present Siamese Masked Autoencoders (SiamMAE): a simple extension of MAEs for learning visual correspondence from videos. In our approach, two frames are randomly selected from a video clip, with the future frame having a significant portion (95%) of its patches randomly masked, while the past frame is left intact. These frames are processed independently by an encoder network, and a decoder composed of a sequence of cross-attention layers is tasked with predicting the missing patches in the future frame. Our asymmetric masking approach encourages the network to model object motion, or in other words, to understand what went where [32]. Simple extensions of MAEs to frames with symmetric masking wastes model capacity on modeling low-level image details. However, by providing the entire past frame as input, our network is primarily focused on propagating the patches from the past frame to their corresponding locations in the future frame. The cross-attention layers in our decoder serve a function akin to the affinity matrix often employed in self-supervised correspondence learning approaches. Empirically, we find that the combination of asymmetric masking, a siamese encoder, and our decoder can effectively learn features suitable for tasks requiring fine-grained and object-level correspondence.
Despite the conceptual simplicity of our method, it outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation. Moreover, our ViT-S/16 models significantly outperform larger models trained on ImageNet (+8.5% J &F m for ViT-B) and Kinetics-400 (+7.4% J &F m for ViT-L) via MVM in video object segmentation tasks. We also observe significant performance gains across all tasks with models trained with smaller patch sizes (ViT-S/8, ViT-B/8). SiamMAE achieves competitive results without relying on data augmentation [16,17], handcrafted tracking-based pretext tasks [15,14], multi-crop training [17], additional techniques to prevent representational collapse [10,11,17,16] or enhance performance [11]. We believe that our detailed analysis, straightforward approach, and state-of-the-art performance can serve as a robust baseline for self-supervised correspondence learning.

Section: Related Work
Temporal correspondence. The visual world is smooth and continuous [33,34], providing a rich source of information for biological and machine vision systems. In biological vision, infants learn about objects and their properties by establishing temporal correspondence, taking advantage of the inherent smoothness in the visual input [35]. Similarly, in machine vision, learning fine-grained correspondence from video frames is an important problem that has been studied for decades in the form of optical flow and motion estimation [36-42, 5, 6, 43, 44, 7]. However, despite their impressive performance, these methods rely on costly human-annotated or synthetic data with pixel-level ground truth annotations [45,46]. A more semantically meaningful task involves determining object-level correspondence i.e., visual object tracking [47][48][49][50][51][52]. One popular approach is tracking-by-matching methods that utilize deep features learned via supervised [53,54] or self-supervised learning [10-12, 14-16] on videos. State-of-the-art methods [14][15][16][17] for self-supervised feature learning for correspondence primarily employ some form of contrastive learning [13]. Predictive learning has also been used for learning correspondences [10,11] by predicting the target colors for gray-scale input frame by observing a colorful reference frame. However, the performance of these methods has trailed behind contrastive approaches. In this work, we show that predictive learning based methods can be used for learning fine-grained and object-level correspondence.  19,17,20] or only similarity [77,21]. Furthermore, contrastive learning has also been successfully applied to videos [78-82, 82, 83]. However, a limitation of contrastive approaches is their dependence on careful selection of augmentations to learn useful invariances [18], and as well as the need for a suite of additional components [19,20,17,21] to prevent representational collapse.

Section: Self
Masked autoencoders. Masked autoencoders are a type of denoising autoencoder [84] that learn representations by reconstructing the original input from corrupted (i.e., masked) inputs. The introduction of masked language modeling in BERT [85] has had a transformative impact on the natural language processing field, particularly when scaled to large datasets and model sizes [86,87]. Masked autoencoders have also been successfully adapted to learn representations from images [24][25][26] and videos [28,27]. Our work studies a simple extension of MAEs [24] to videos. However, unlike prior methods [28,27] that symmetrically mask all frames, we propose an asymmetric masking scheme, leaving the past frame unchanged and masking a higher percentage of the future frame.  19,17,20], as their design allows an easy way to learn invariant visual representations from data. Inspired by the success of masked autoencoders, researchers have also explored combining contrastive learning with siamese networks and masked visual modeling [91,92]. However, we are not aware of any previous studies that have investigated siamese masked autoencoders using asymmetric masking for representation learning from videos.

Section: Siamese


Section: Method
Our goal is to develop a self-supervised method for learning correspondence. To that end, we study a simple extension of MAE [24] to video data (Fig. 1). In this section, we describe the key components of our Siamese Masked Autoencoder.
Patchify. Given a video clip with L frames, we first randomly sample 2 frames f 1 and f 2 . The distance between these two frames is determined by selecting a random value from the predetermined range of potential frame gaps. Following the original ViT [31], we "patchify" each frame by converting it into a sequence of non-overlapping N × N patches. Finally, position embeddings [94] 
f 1 f 2 f 1 f 2 Figure 2:
Visualizations on the Kinetics-400 [93] validation set (masking ratio 90%). For each video sequence, we sample a clip of 8 frames with a frame gap of 4 and show the original video (top), SiamMAE output (middle), and masked future frames (bottom). Reconstructions are shown with f 1 as the first frame of the video clip and f 2 as the remaining frames, using a SiamMAE pre-trained ViT-S/8 encoder with a masking ratio of 95%.
are added to the linear projections [31] of the patches, and a [CLS] token is appended. We do not use any temporal position embeddings.
Masking. Natural signals like images and videos are highly redundant, exhibiting spatial and spatiotemporal redundancies, respectively [33,34]. To create a challenging predictive self-supervised learning task, MAEs randomly mask a high percentage (75%) of image patches [24] and extensions to videos [28,27] use an even higher masking ratio (90%). In both images and videos, the masking strategy is symmetric, i.e., all frames have a similar masking ratio. This deliberate design choice prevents the network from leveraging and learning temporal correspondence, leading to sub-optimal performance on correspondence learning benchmarks.
We posit that asymmetric masking can create a challenging self-supervised learning task while encouraging the network to learn temporal correlations. Specifically, we do not mask any patches in f 1 (0%) and mask a very high ratio (95%) of patches in f 2 . By providing the entire past frame as input, the network only needs to propagate the patches from the past frames to their appropriate locations in the future frame. This, in turn, encourages the network to model object motion and focus on object boundaries (Fig. 5). To further increase the difficulty of the task, we sample the two frames with a large temporal gap. Although predicting further into the future is inherently ambiguous and may yield multiple plausible outcomes, providing a small number of patches as input for the second frame results in a challenging yet tractable self-supervised learning task.
Encoder. We explore two different encoder configurations for processing input frames.
A joint encoder is a natural extension of image MAEs to a pair of frames. The unmasked patches from the two frames are concatenated and then processed by a standard ViT encoder.
A siamese encoder [88] are weight-sharing neural networks used for comparing entities and are an essential component of modern contrastive representation learning methods [21]. Siamese networks have been used for correspondence learning [53,11,10] and often require some information bottleneck to prevent the network from learning trivial solutions. For example, Lai and Xie [11] propose to use color channel dropout to force the network to avoid relying on colors for matching correspondences.  ), human pose propagation (JHMDB [96]) and semantic part propagation (VIP [97]).
We use siamese encoders to process the two frames independently and our asymmetric masking serves as an information bottleneck.
Decoder. The output from the encoder is projected using a linear layer and [MASK] tokens with position embeddings are added to generate the full set of tokens corresponding to the input frame. We explore three different decoder configurations which operate on the full set of tokens.
A joint decoder applies vanilla Transformer blocks on the concatenation of full set of tokens from both frames. A key downside of this approach is a substantial increase in GPU memory requirement, especially when using smaller patch sizes.
A cross-self decoder is similar to the original encoder-decoder design of the Transformer [94] model. Each decoder block consists of a cross-attention layer and a self-attention layer. The tokens from f 2 attend to the tokens from f 1 via the cross-attention layer and then attend to each other via the self-attention layer. We note that the cross-attention layer is functionally similar to the affinity matrix often used in self-supervised correspondence learning approaches [11,10].
A cross decoder consists of decoder blocks with only cross-attention layers, where tokens from f 2 attend to the tokens from f 1 .
Finally, the output sequence of the decoder is used to predict the normalized pixel values [24] in the masked patches. l2 loss is applied between the prediction of the decoder and the ground truth.

Section: Experiments
In this section, we evaluate our method on three different tasks, compare its performance with prior state-of-the-art methods, and perform extensive ablation studies of different design choices. For qualitative results, see Fig. 2, Fig. 3, Fig. 5 and videos on our project website.

Section: Experimental Setup
Backbone. We use the ViT-S/16 model for most of our experiments as it is similar to ResNet-50 in terms of the number of parameters (21M vs 23M) and allows for fair comparisons across different self-supervised learning and correspondence learning methods.
Pre-training. Models are pre-trained using Kinetics-400 [93]  ), human pose propagation (JHMDB [96]) and semantic part propagation (VIP [97]). and horizontal flipping. Training is done for 400 epochs for the ablation studies (Table 2, 3) and for 2000 epochs for the results in Table 1. We adopt repeated sampling factor [99,27] of 2 and report "effective epochs", i.e., the number of times a training video is viewed during training. We use the AdamW optimizer [100] with a batch size of 2048. Additional training details are in § A.
Evaluation methodology. We evaluate the quality of learned representations for dense correspondence task using k-nearest neighbor inference on three downstream tasks: video object segmentation (DAVIS-2017 [95]), human pose propagation (JHMDB [96]) and semantic part propagation (VIP [97]). Following prior work [14][15][16], all tasks are formulated as video label propagation: given the ground-truth label for the initial frame, the goal is to predict the label for each pixel in future frames of a video. We also provide temporal context during inference by maintaining a queue of the last m frames, and we limit the set of source patches considered to a spatial neighborhood of the query patch. See § A for evaluation hyperparameters.

Section: Comparison with Prior Work
Video Object Segmentation. We first evaluate our model on DAVIS 2017 [95], a benchmark for video object segmentation, for the task of semi-supervised multi-object segmentation. We follow the evaluation protocol of prior work and use images of a 480p resolution for evaluation. We find that SiamMAE significantly outperforms VideoMAE (39.3% to 62.0%), which we attribute to the use of tube masking scheme in VideoMAE which prevents the model from learning temporal correspondences. Like DINO [17], we also find that reducing the patch size leads to significant performance gains. Our ViT-S/8 (+9.4%) model outperforms all prior contrastive learning and selfsupervised correspondence learning approaches. Finally, we note that although the larger MAE-ST models (ViT-L/16, 304M parameters) trained with random masking perform better than VideoMAE, their performance still lags SiamMAE by a considerable margin. Surprisingly, we find that MAEs trained on videos perform similarly to image MAEs. Unlike images, which are (approximately) isotropic [29], the temporal dimension is special [30], and not all spatio-temporal orientations are equally likely. Hence, symmetrically treating spatial and temporal information might be sub-optimal.
Video Part Segmentation. Next, we evaluate SiamMAE on the Video Instance Parsing (VIP) [97] benchmark, which involves propagating semantic masks for 20 different human parts. Compared to other datasets in our evaluation, VIP is especially challenging, as it involves much longer videos (up to 120 seconds). We follow the evaluation protocol of prior work [12], using 560 × 560 images and a single context frame. On this challenging task, our ViT-S/8 model substantially outperforms DINO (39.5 to 45.9). SiamMAE benefits more from smaller patch sizes than DINO, achieving an +8.6  Pose Tracking. We evaluate SiamMAE on the task of keypoint propagation, which involves propagating 15 keypoints and requires spatially precise correspondence. We follow the evaluation protocol of prior work [12], using 320 × 320 images and a single context frame. SiamMAE outperforms all prior work and benefits more from smaller patch sizes than DINO (+14.9 to +10.9 PCK@0.1).
Finally, we test the scalability of SiamMAE by training and evaluating ViT-B models. Across all three tasks, ViT-B models outperformed ViT-S models for both patch sizes tested.

Section: Ablation Studies
We ablate SiamMAE to understand the contribution of each design decision with the default settings: siamese encoder, cross-self decoder, asymmetric masking ratio (95%), frame sampling gap 4 -48.
FrameMAE. We compare SiamMAE with FrameMAE (Table 2a), an extension of MAEs to video frames, i.e., joint encoder and joint decoder with symmetric masking ratio. FrameMAE performs significantly worse when the masking ratio is too high (90%) or too low (50%). With a 90% masking ratio, the task becomes challenging (higher loss) due to the insufficient number of patches available to learn temporal correspondence. When the masking ratio is 50%, the task becomes easier (lower loss) and the network can reconstruct the frames without relying on temporal information, due to the spatial redundancy of images. SiamMAE with an asymmetric masking ratio works best.
Encoder-decoder design. An important design decision of SiamMAE is the choice of encoder and decoder. We study the performance of various combinations of encoders and decoders with asymmetric masking in Table 2b. Joint encoders perform significantly worse compared to their siamese counterparts across all decoder designs. This can be attributed to the difference between the training and testing setups, as each frame is processed independently during the testing phase.
Siamese encoder with cross decoder performs worst among siamese encoders. We also observe that the training loss is higher and the reconstructed frames are spatially incoherent, as all patches from f 2 are processed independently. Finally, the combination of a siamese encoder with a cross-self decoder outperforms all other pairings. The cross-attention operation is similar to the affinity matrix used spatial color J &Fm Jm Fm 56.8 55. 5    in self-supervised correspondence learning and is also used for label propagation in our evaluation protocol. Hence, by processing the frames independently and decoding them via the cross-self decoder, the network is encouraged to learn good representations for dense visual correspondence.
Masking. Next, we discuss the effect of the masking scheme for the combination of a siamese encoder with a self-cross-decoder. Random symmetric masking performs poorly and is also worse than the corresponding FrameMAE configurations (Table 2a,2c). We also study the grid-wise mask sampling strategy, which keeps every alternate patch. This is an easier task, as the masking pattern enables the network to exploit and learn spatio-temporal correlations. Although we see significant gains (41.5 to 48.2), performance is still significantly poor compared to SiamMAE. In Table 2d, we study the role of different asymmetric masking ratios. We notice a clear trend: increasing the masking ratio from 50% to 95% increases the performance (49.0% to 58.1%).

Section: Data augmentation.
In Table 3a we study the influence of different data augmentation strategies. Similar to the findings in the image [24] and video [27] domains, we find that SiamMAE does not require extensive data augmentation to achieve competitive performance. Random cropping with a scale range of [0.5, 1] and horizontal flipping works best, and adding color jitter leads to performance degradation. Contrastive methods like DINO show impressive k-NN performance by using extensive data augmentation. In contrast SiamMAE achieves superior results by relying on natural data augmentation available in videos, discussed next.
Frame sampling. Video data is a rich source of data augmentation, e.g. variations in pose, lighting viewpoint, occlusions, etc. To effectively leverage this, we study the importance of frame sampling in Table 3b. The performance improves as we increase the frame sampling gap. Natural videos frequently exhibit gradual temporal changes; therefore, increasing the frame interval results in a more robust natural data augmentation, which in turn enhances performance. Our frame sampling strategy is simple and effective: randomly sample frames with a frame gap ranging from 4 to 48 frames. Prediction target. In Table 5a (see § B) we study the importance of predicting the future. We consider two additional SiamMAE variations: one where we always predict the past frame (f1) and another where the order of frame prediction (f1 or f2) is randomized. All variations perform reasonably well, with our default setting (i.e., predicting the future) performing the best. We emphasize predicting future behavior due to its natural alignment with most real-world applications, which often necessitate the anticipation or prediction of agents' future behavior.

Section: Attention Map Analysis
In Fig. 5, we visualize the self-attention map of the ViT-S/8 model. We use the [CLS] token as the query and visualize the attention of a single head from the last layer with 720p images from ImageNet. We find that the model attends to the object boundaries. For instance, it can clearly delineate iconic objects (such as the sheep in the first row, first column), multiple objects (like the three baseball players in the third row, sixth column), and even when the scene is cluttered (as seen with the bird in the second row, fourth column). While other self-supervised learning approaches [17,101] have reported emergent object segmentation capabilities, we are unaware of any methods demonstrating an emergent ability to predict object boundaries. This emergent ability is unique and surprising since, unlike contrastive learning approaches, no loss function operates on the [CLS] token in SiamMAE (or in MAEs). We attribute the emergence of this ability to our asymmetric masking ratio, which encourages the model to learn about object boundaries from object motion in videos.

Section: Failure Analysis
We evaluate the quality of learnt representations using label propagation and consequently inherit its limitations. Specifically, the inference algorithm lacks semantic understanding, leading to globally inconsistent labels (Fig 6). This limitation can be overcome by fine tuning the learnt representations with task specific architectural changes. Additionally, there are instances where the inference process might miss intricate object details, like the spokes of a tire. While this shortcoming can be mitigated by using a smaller patch size during training and inference, it comes at a higher compute cost.

Section: Conclusion
In this work, we introduce SiamMAE, a simple method for representation learning from videos.
Our approach is based on the intuition that the temporal dimension should be treated differently from the spatial dimension. We demonstrate that an asymmetric masking strategy, i.e., masking a high percentage of patches of the future frame while keeping the past frame unchanged, is an effective strategy for learning correspondence. By predicting a majority fraction of the future frame, we find that our SiamMAE is able to learn the notion of object boundaries (Fig 5). Moreover, unlike MAEs, features learned via our approach can be used in a zero-shot manner and outperform state-of-the-art self-supervised methods in various tasks, such as video object segmentation, pose keypoint propagation, and semantic part propagation. SiamMAE achieves these competitive results without the need for data augmentation, handcrafted tracking-based pretext tasks, or other techniques to prevent representational collapse. We hope our work will encourage further exploration of learning representations by predicting the future.
Future work. Our study focuses on learning correspondences by operating on pairs of video frames. This choice was driven by the empirical success of the approach and the limited computational resources available. Consequently, we believe that further investigation is needed to understand the role of predicting multiple future frames based on past frames, both for general visual representation learning and for correspondence learning specifically. An important future direction is to systematically examine the scalability of our approach in terms of both data and model size. Following previous work, we utilize internet videos for pre-training. However, it is essential to also investigate the impact of different types of video data, such as egocentric videos [102] versus "in-the-wild" internet videos. Lastly, our learned representations hold potential for applications involving embodied agents (i.e., robots), as the concept of correspondence could be useful in tasks such as object manipulation, navigation, and interaction within dynamic environments.

Section: 
Acknowledgments. We thank Abhishek Kadian for helpful discussions. This research was in part supported by the Stanford Institute for Human-Centered Artificial Intelligence (HAI) and ONR MURI N00014-22-1-2740.

Section: A Implementation Details
Training. Our training settings follow [24] and we build on the open-source implementation of MAEs (https://github.com/facebookresearch/mae) for all our experiments. We use the parameters specified in the original implementation unless specified otherwise in Table 4a. All our experiments are performed on 4 Nvidia Titan RTX GPUs for ViT-S/16 models, and on 8 Nvidia Titan RTX GPUs for ViT-S/8 models and ViT-B models.
Evaluation methodology. Our evaluation methodology follows prior work [14][15][16] and in Table 1 we report results previously reported in these studies. For recent self-supervised learning approaches like DINO, MAEs, MAE-ST and VideoMAE, we carry out a comprehensive grid search on the evaluation hyperparameters listed in Table 4b, and report the optimal results obtained. The evaluation parameters for SiamMAE can be found in Table 4b 

Section: B Additional Ablations
Prediction target. In Table 5a we study the importance of predicting the future. We consider two additional SiamMAE variations: one where we always predict the past frame (f1) and another where the order of frame prediction (f1 or f2) is randomized. All variations perform reasonably well, with our default setting (i.e., predicting the future) performing the best. We emphasize predicting future behavior due to its natural alignment with most real-world applications, which often necessitate the anticipation or prediction of agents' future behavior.
Frame overlap analysis. To perform frame overlap analysis, we sampled video frames from the Kinetics-400 validation set with the specified frame gap and calculated two image similarity metrics: mean squared error (mse) and structural similarity index measure (ssim). We observed that either a very high overlap (low frame gap, high ssim, and low mse) or a low overlap (high frame gap, low ssim, and high mse) adversely affects performance. Sampling with a frame gap of 16 or within a range of [4,48] yields the best results. Interestingly, the overlap metrics for a frame gap of 16 and [4,48] are comparable, suggesting that a particular degree of overlap is important for best results.  


References:
[b0] Federico Perazzi; Jordi Pont-Tuset; Brian Mcwilliams; Luc Van Gool; Markus Gross; Alexander Sorkine-Hornung (2016). A benchmark dataset and evaluation methodology for video object segmentation. 
[b1] Qiang Wang; Li Zhang; Luca Bertinetto; Weiming Hu; Philip Hs Torr (2019). Fast online object tracking and segmentation: A unifying approach. 
[b2] Ashutosh Saxena; Jamie Schulte; Andrew Y Ng (2007). Depth estimation using monocular and stereo cues. 
[b3] Haofei Xu; Jing Zhang; Jianfei Cai; Hamid Rezatofighi; Fisher Yu; Dacheng Tao; Andreas Geiger (2022). Unifying flow, stereo and depth estimation. 
[b4] Alexey Dosovitskiy; Philipp Fischer; Eddy Ilg; Philip Hausser; Caner Hazirbas; Vladimir Golkov; Patrick Van Der; Daniel Smagt; Thomas Cremers;  Brox (2015). Flownet: Learning optical flow with convolutional networks. 
[b5] Eddy Ilg; Nikolaus Mayer; Tonmoy Saikia; Margret Keuper; Alexey Dosovitskiy; Thomas Brox (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. 
[b6] Zachary Teed; Jia Deng (2020). Raft: Recurrent all-pairs field transforms for optical flow. Springer
[b7] Richard Hartley; Andrew Zisserman (2003). Multiple view geometry in computer vision. Cambridge university press
[b8]  (2023-05). Self-supervised learning: The dark matter of intelligence. 
[b9] Carl Vondrick; Abhinav Shrivastava; Alireza Fathi; Sergio Guadarrama; Kevin Murphy (2018). Tracking emerges by colorizing videos. 
[b10] Zihang Lai; Weidi Xie (2019). Self-supervised learning for video correspondence flow. 
[b11] Xueting Li; Sifei Liu; Shalini De Mello; Xiaolong Wang; Jan Kautz; Ming-Hsuan Yang (2019). Joint-task self-supervised learning for temporal correspondence. Advances in Neural Information Processing Systems
[b12] Raia Hadsell; Sumit Chopra; Yann Lecun (2006). Dimensionality reduction by learning an invariant mapping. IEEE
[b13] Xiaolong Wang; Allan Jabri; Alexei A Efros (2019). Learning correspondence from the cycleconsistency of time. 
[b14] Allan Jabri; Andrew Owens; Alexei Efros (2020). Space-time correspondence as a contrastive random walk. Advances in neural information processing systems
[b15] Jiarui Xu; Xiaolong Wang (2021). Rethinking self-supervised correspondence learning: A video frame-level similarity perspective. 
[b16] Mathilde Caron; Hugo Touvron; Ishan Misra; Hervé Jégou; Julien Mairal; Piotr Bojanowski; Armand Joulin (2021). Emerging properties in self-supervised vision transformers. 
[b17] Tete Xiao; Xiaolong Wang; Alexei A Efros; Trevor Darrell (2020). What should not be contrastive in contrastive learning. 
[b18] Kaiming He; Haoqi Fan; Yuxin Wu; Saining Xie; Ross Girshick (2020). Momentum contrast for unsupervised visual representation learning. 
[b19] Ting Chen; Simon Kornblith; Mohammad Norouzi; Geoffrey Hinton (2020). A simple framework for contrastive learning of visual representations. PMLR
[b20] Xinlei Chen; Kaiming He (2021). Exploring simple siamese representation learning. 
[b21] Jacob Devlin; Ming-Wei Chang; Kenton Lee; Kristina Toutanova (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. 
[b22] Tom Brown; Benjamin Mann; Nick Ryder; Melanie Subbiah; Jared D Kaplan; Prafulla Dhariwal; Arvind Neelakantan; Pranav Shyam; Girish Sastry; Amanda Askell; Sandhini Agarwal; Ariel Herbert-Voss; Gretchen Krueger; Tom Henighan; Rewon Child; Aditya Ramesh; Daniel Ziegler; Jeffrey Wu; Clemens Winter; Chris Hesse; Mark Chen; Eric Sigler; Mateusz Litwin; Scott Gray; Benjamin Chess; Jack Clark; Christopher Berner; Sam Mccandlish; Alec Radford; Ilya Sutskever; Dario Amodei (2020). Language models are few-shot learners. 
[b23] Kaiming He; Xinlei Chen; Saining Xie; Yanghao Li; Piotr Dollár; Ross Girshick (2008). Masked autoencoders are scalable vision learners. 
[b24] Hangbo Bao; Li Dong; Songhao Piao; Furu Wei (2022). BEit: BERT pre-training of image transformers. 
[b25] Zhenda Xie; Zheng Zhang; Yue Cao; Yutong Lin; Jianmin Bao; Zhuliang Yao; Qi Dai; Han Hu (2022). Simmim: A simple framework for masked image modeling. 
[b26] Christoph Feichtenhofer; Haoqi Fan; Yanghao Li; Kaiming He (2008). Masked autoencoders as spatiotemporal learners. 
[b27] Zhan Tong; Yibing Song; Jue Wang; Limin Wang (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. 
[b28]  Daniel L Ruderman (1994). The statistics of natural images. Network: computation in neural systems
[b29] H Edward;  Adelson;  James R Bergen (1985). Spatiotemporal energy models for the perception of motion. Josa a
[b30] Alexey Dosovitskiy; Lucas Beyer; Alexander Kolesnikov; Dirk Weissenborn; Xiaohua Zhai; Thomas Unterthiner; Mostafa Dehghani; Matthias Minderer; Georg Heigold; Sylvain Gelly (2020). An image is worth 16x16 words: Transformers for image recognition at scale. 
[b31] Josh Wills; Sameer Agarwal; Serge J Belongie (2003). What went where. 
[b32] Fred Attneave (1954). Some informational aspects of visual perception. Psychological review
[b33] P Eero; Bruno A Simoncelli;  Olshausen (2001). Natural image statistics and neural representation. Annual review of neuroscience
[b34] Peter Elizabeth S Spelke; Claes Vishton;  Von Hofsten (1995). Object perception, object-directed action, and physical knowledge in infancy. 
[b35] J James;  Gibson (1950). The perception of the visual world. 
[b36] K P Berthold; Brian G Horn;  Schunck (1981). Determining optical flow. Artificial intelligence
[b37] D Bruce; Takeo Lucas;  Kanade (1981). An iterative image registration technique with an application to stereo vision. 
[b38] Thomas Brox; Andrés Bruhn; Nils Papenberg; Joachim Weickert (2004). High accuracy optical flow estimation based on a theory for warping. Springer
[b39] Deqing Sun; Stefan Roth; Michael J Black (2010). Secrets of optical flow estimation and their principles. IEEE
[b40] Ce Liu; Jenny Yuen; Antonio Torralba (2010). Sift flow: Dense correspondence across scenes and its applications. IEEE transactions on pattern analysis and machine intelligence
[b41] Christian Moritz Menze; Andreas Heipke;  Geiger (2015-10-07). Discrete optimization for optical flow. Springer
[b42] Jia Xu; René Ranftl; Vladlen Koltun (2017). Accurate optical flow via direct cost volume processing. 
[b43] Deqing Sun; Xiaodong Yang; Ming-Yu Liu; Jan Kautz (2018). Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. 
[b44] D J Butler; J Wulff; G B Stanley; M J Black (2012-10). A naturalistic open source movie for optical flow evaluation. Springer-Verlag
[b45] Andreas Geiger; Philip Lenz; Christoph Stiller; Raquel Urtasun (2013). Vision meets robotics: The kitti dataset. The International Journal of Robotics Research
[b46] K Ishwar; Ramesh Sethi;  Jain (1987). Finding trajectories of feature points in a monocular image sequence. IEEE Transactions on pattern analysis and machine intelligence
[b47] Dorin Comaniciu; Ramesh Visvanathan; Peter Meer (2003). Kernel-based object tracking. IEEE Transactions on pattern analysis and machine intelligence
[b48] Changjiang Yang; Ramani Duraiswami; Larry Davis (2005). Efficient mean-shift tracking via a new similarity measure. IEEE
[b49] Mykhaylo Andriluka; Stefan Roth; Bernt Schiele (2008). People-tracking-by-detection and people-detection-by-tracking. IEEE
[b50] Zdenek Kalal; Krystian Mikolajczyk; Jiri Matas (2011). Tracking-learning-detection. IEEE transactions on pattern analysis and machine intelligence
[b51] Philipp Bergmann; Tim Meinhardt; Laura Leal-Taixe (2019). Tracking without bells and whistles. 
[b52] Luca Bertinetto; Jack Valmadre; Joao F Henriques; Andrea Vedaldi; Philip Hs Torr (2016). Fully-convolutional siamese networks for object tracking. Springer
[b53] Jack Valmadre; Luca Bertinetto; Joao Henriques; Andrea Vedaldi; Philip Hs Torr (2017). End-toend representation learning for correlation filter based tracking. 
[b54] Carl Doersch; Abhinav Gupta; Alexei A Efros (2015). Unsupervised visual representation learning by context prediction. 
[b55] Mehdi Noroozi; Paolo Favaro (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. Springer
[b56] Richard Zhang; Phillip Isola; Alexei A Efros (2016). Colorful image colorization. Springer
[b57] Deepak Pathak; Ross Girshick; Piotr Dollár; Trevor Darrell; Bharath Hariharan (2017). Learning features by watching objects move. 
[b58] Spyros Gidaris; Praveer Singh; Nikos Komodakis (2018). Unsupervised representation learning by predicting image rotations. 
[b59] Nitish Srivastava; Elman Mansimov; Ruslan Salakhudinov (2015). Unsupervised learning of video representations using lstms. PMLR
[b60] Jacob Walker; Carl Doersch; Abhinav Gupta; Martial Hebert (2016). An uncertain future: Forecasting from static images using variational autoencoders. Springer
[b61] Carl Vondrick; Hamed Pirsiavash; Antonio Torralba (2016). Anticipating visual representations from unlabeled video. 
[b62] Michael Mathieu; Camille Couprie; Yann Lecun (2015). Deep multi-scale video prediction beyond mean square error. 
[b63] William Lotter; David Gabriel Kreiman;  Cox (). Deep predictive coding networks for video prediction and unsupervised learning. 
[b64] Agrim Gupta; Stephen Tian; Yunzhi Zhang; Jiajun Wu; Roberto Martín-Martín; Li Fei-Fei (2023). Maskvit: Masked visual pre-training for video prediction. 
[b65] Ishan Misra; Lawrence Zitnick; Martial Hebert (2016). Shuffle and learn: unsupervised learning using temporal order verification. Springer
[b66] Basura Fernando; Hakan Bilen; Efstratios Gavves; Stephen Gould (2017). Self-supervised video representation learning with odd-one-out networks. 
[b67] Hsin-Ying Lee; Jia-Bin Huang; Maneesh Singh; Ming-Hsuan Yang (2017). Unsupervised representation learning by sorting sequences. 
[b68] Donglai Wei; Joseph J Lim; Andrew Zisserman; William T Freeman (2018). Learning and using the arrow of time. 
[b69] Dejing Xu; Jun Xiao; Zhou Zhao; Jian Shao; Di Xie; Yueting Zhuang (2019). Self-supervised spatiotemporal learning via video clip order prediction. 
[b70] Pulkit Agrawal; Joao Carreira; Jitendra Malik (2015). Learning to see by moving. 
[b71] Xiaolong Wang; Abhinav Gupta (2015). Unsupervised learning of visual representations using videos. 
[b72] Laurenz Wiskott; Terrence J Sejnowski (2002). Slow feature analysis: Unsupervised learning of invariances. Neural computation
[b73] Ross Goroshin; Joan Bruna; Jonathan Tompson; David Eigen; Yann Lecun (2015). Unsupervised learning of spatiotemporally coherent metrics. 
[b74] Suzanna Becker; Geoffrey E Hinton (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature
[b75] Zhirong Wu; Yuanjun Xiong; Stella X Yu; Dahua Lin (2018). Unsupervised feature learning via non-parametric instance discrimination. 
[b76] Jean-Bastien Grill; Florian Strub; Florent Altché; Corentin Tallec; Pierre Richemond; Elena Buchatskaya; Carl Doersch; Bernardo Avila Pires; Zhaohan Guo; Mohammad Gheshlaghi Azar (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems
[b77] Pierre Sermanet; Corey Lynch; Yevgen Chebotar; Jasmine Hsu; Eric Jang; Stefan Schaal; Sergey Levine; Google Brain (2018). Time-contrastive networks: Self-supervised learning from video. IEEE
[b78] Chen Sun; Fabien Baradel; Kevin Murphy; Cordelia Schmid (2019). Learning video representations using contrastive bidirectional transformer. 
[b79] Tengda Han; Weidi Xie; Andrew Zisserman (2019). Video representation learning by dense predictive coding. 
[b80] Christoph Feichtenhofer; Haoqi Fan; Bo Xiong; Ross Girshick; Kaiming He (2021). A largescale study on unsupervised spatiotemporal representation learning. 
[b81] Adria Recasens; Pauline Luc; Jean-Baptiste Alayrac; Luyu Wang; Florian Strub; Corentin Tallec; Mateusz Malinowski; Florent Viorica Pȃtrȃucean; Michal Altché;  Valko (2021). Broaden your views for self-supervised video learning. 
[b82] Rui Qian; Tianjian Meng; Boqing Gong; Ming-Hsuan Yang; Huisheng Wang; Serge Belongie; Yin Cui (2021). Spatiotemporal contrastive video representation learning. 
[b83] Pascal Vincent; Hugo Larochelle; Yoshua Bengio; Pierre-Antoine Manzagol (2008). Extracting and composing robust features with denoising autoencoders. 
[b84] Jacob Devlin; Ming-Wei Chang; Kenton Lee; Kristina Toutanova (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. 
[b85] Tom Brown; Benjamin Mann; Nick Ryder; Melanie Subbiah; Jared D Kaplan; Prafulla Dhariwal; Arvind Neelakantan; Pranav Shyam; Girish Sastry; Amanda Askell (2020). Language models are few-shot learners. 
[b86] Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever (2019). Language models are unsupervised multitask learners. OpenAI blog
[b87] Jane Bromley; Isabelle Guyon; Yann Lecun; Eduard Säckinger; Roopak Shah (1993). Signature verification using a" siamese" time delay neural network. Advances in neural information processing systems
[b88] Yaniv Taigman; Ming Yang; Marc'aurelio Ranzato; Lior Wolf (2014). Deepface: Closing the gap to human-level performance in face verification. 
[b89] Gregory Koch; Richard Zemel; Ruslan Salakhutdinov (2015). Siamese neural networks for one-shot image recognition. 
[b90] Jinghao Zhou; Chen Wei; Huiyu Wang; Wei Shen; Cihang Xie; Alan Yuille; Tao Kong (2021). ibot: Image bert pre-training with online tokenizer. 
[b91] Mahmoud Assran; Mathilde Caron; Ishan Misra; Piotr Bojanowski; Florian Bordes; Pascal Vincent; Armand Joulin; Mike Rabbat; Nicolas Ballas (2022). Masked siamese networks for label-efficient learning. Springer
[b92] Will Kay; Joao Carreira; Karen Simonyan; Brian Zhang; Chloe Hillier; Sudheendra Vijayanarasimhan; Fabio Viola; Tim Green; Trevor Back; Paul Natsev (2017). The kinetics human action video dataset. 
[b93] Ashish Vaswani; Noam Shazeer; Niki Parmar; Jakob Uszkoreit; Llion Jones; Aidan N Gomez; Łukasz Kaiser; Illia Polosukhin (2017). Attention is all you need. Advances in neural information processing systems
[b94] Jordi Pont-Tuset; Federico Perazzi; Sergi Caelles; Pablo Arbeláez; Alex Sorkine-Hornung; Luc Van Gool (2017). The 2017 davis challenge on video object segmentation. 
[b95] Hueihan Jhuang; Juergen Gall; Silvia Zuffi; Cordelia Schmid; Michael J Black (2013). Towards understanding action recognition. 
[b96] Qixian Zhou; Xiaodan Liang; Ke Gong; Liang Lin (2018). Adaptive temporal encoding network for video instance-level human parsing. 
[b97] Kaiming He; Xiangyu Zhang; Shaoqing Ren; Jian Sun (2016). Deep residual learning for image recognition. 
[b98] Elad Hoffer; Tal Ben-Nun; Itay Hubara; Niv Giladi; Torsten Hoefler; Daniel Soudry (2020). Augment your batch: Improving generalization through instance repetition. 
[b99] Ilya Loshchilov; Frank Hutter (2019). Decoupled weight decay regularization. 
[b100] Jinghao Zhou; Chen Wei; Huiyu Wang; Wei Shen; Cihang Xie; Alan Yuille; Tao Kong (2022). Image BERT pre-training with online tokenizer. 
[b101] Kristen Grauman; Andrew Westbury; Eugene Byrne; Zachary Chavis; Antonino Furnari; Rohit Girdhar; Jackson Hamburger; Hao Jiang; Miao Liu; Xingyu Liu (2021). Ego4d: Around the world in 3,000 hours of egocentric video. 
[b102] Mark Chen; Alec Radford; Rewon Child; Jeffrey Wu; Heewoo Jun; David Luan; Ilya Sutskever (2020). Generative pretraining from pixels. PMLR
[b103] Ilya Loshchilov; Frank Hutter (2016). Sgdr: Stochastic gradient descent with warm restarts. 
[b104] Priya Goyal; Piotr Dollár; Ross Girshick; Pieter Noordhuis; Lukasz Wesolowski; Aapo Kyrola; Andrew Tulloch; Yangqing Jia; Kaiming He (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. 

Figures:
Figure fig_0: 1
Type: figure
Caption: Figure 1 :1Figure 1: Siamese Masked Autoencoders. During pre-training we randomly sample a pair of video frames and randomly mask a huge fraction (95%) of patches of the future frame while leaving the past frame unchanged. The two frames are processed independently by a siamese encoder parametrized by a ViT [31]. The decoder consists of a sequence of cross-attention layers and predicts missing patches in the future frame. Videos available at this project page.
Data: 

Figure fig_1: 3
Type: figure
Caption: Figure 3 :3Figure 3: Qualitative results on three downstream tasks: video object segmentation (DAVIS-2017[95]), human pose propagation (JHMDB[96]) and semantic part propagation (VIP[97]).
Data: 

Figure fig_2: 4
Type: figure
Caption: Figure 4 :4Figure 4: Training schedule and patch size. Evaluation of SiamMAE performance for 3 downstream tasks for ViT-S/16 and ViT-S/8 models. Across all tasks, longer training and smaller patch sizes lead to improved performance.
Data: 

Figure fig_3: 5
Type: figure
Caption: Figure 5 :5Figure 5: Self-attention maps. Self-attention maps from a ViT-S/8 model. We examine the selfattention of the [CLS] token on the heads of the final layer. Unlike contrastive methods, there is no explicit loss function acting on the [CLS] token. These self-attention maps suggest that the model has learned the notion of object boundaries from object motion in videos. See project page for videos.
Data: 

Figure fig_4: 6
Type: figure
Caption: Figure 6 :6Figure 6: Failure analysis. A key disadvantage of using label propagation is the lack of global semantic understanding of objects. Assigning labels based solely on low-level features can lead to globally inconsistent labels, as illustrated by the following examples: (a) a segmentation mask that covers both hands; (b) a pose key-point determined using the person's hair, rather than their posture; (c) challenges in assigning labels to parts of the object that are occluded in the reference frame; and (d) the inability to assign labels to fine object details, such as the spokes of a tire.
Data: 

Figure tab_3: 1
Type: table
Caption: Comparison with prior work on three downstream tasks: video object segmentation (DAVIS-2017[95]
Data: DAVISVIPJHMDB

Figure tab_5: 2
Type: table
Caption: SiamMAE ablation experiments on DAVIS[95] with the default setting: siamese encoder,
Data: 

Figure tab_7: 3
Type: table
Caption: Data augmentation. We ablate the importance of manual (spatial and color jitter) and natural data augmentation (frame sampling) for learning correspondence via our SiamMAE on DAVIS[95]. The table format follows Table2.
Data: J&F-Mean55 60 65 7051.6 63.1 Object Propagation 54.9 56.9 59.1 66.6 68.1 69.661.5 71.0mIoU30 35 40 4529.7 36.8 Semantic Part Propagation 32.0 33.1 34.8 36.3 39.8 40.6 41.5 42.6PCK@0.145 50 55 60 6542.4 43.1 43.4 59.1 59.6 60.1 60.5 61.5 45.1 46.2 Pose Propagation5040100200 Epochs (log-scale) 400 800 1600100200 Epochs (log-scale) 400 800 1600100200 Epochs (log-scale) 400 800 1600ViT-S/16ViT-S/8


Formulas:
Formula formula_0: f 1 f 2 f 1 f 2 Figure 2:

