Abstract: Highlights•We decouple temporal and channel features to reduce video computation cost.•We adopt transposed attention to focus on channels, reducing computation cost.•We leverage a global query strategy to capture global information.•We propose a depth shift module to better integrate cross-channel or temporal information.•Our video transformer achieves good quality for video prediction efficiently.
Loading