Scalable video transformer for full-frame video prediction

Zhan Li, Feng Liu

Published: 2024, Last Modified: 14 May 2025Comput. Vis. Image Underst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•We decouple temporal and channel features to reduce video computation cost.•We adopt transposed attention to focus on channels, reducing computation cost.•We leverage a global query strategy to capture global information.•We propose a depth shift module to better integrate cross-channel or temporal information.•Our video transformer achieves good quality for video prediction efficiently.