Abstract: This paper presents a new self-supervised video representation learning framework \textbf{ARVideo}, which \textit{autoregressively} predict the next video token in a tailored sequence order.
Two key designs are included. First, we organize autoregressive video tokens into
clusters that span both \textit{spatially} and \textit{temporally}, thereby enabling a richer aggregation of contextual information compared to the standard spatial-only or temporal-only clusters. Second, we adopt a randomized spatiotemporal prediction order to facilitate learning from multi-dimensional data, addressing the limitations of a handcrafted spatial-first or temporal-first sequence order. Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning. For example,
when trained with the ViT-B backbone, ARVideo competitively attains 81.2\% on Kinetics-400 and 70.9\% on Something-Something V2, which are on par with the strong benchmark set by VideoMAE. Importantly, ARVideo also demonstrates higher training efficiency, \ie, it trains 14\% faster and requires 58\% less GPU memory compared to VideoMAE.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Ming-Hsuan_Yang1
Submission Number: 3700
Loading