Abstract: Highlights•A new efficient Transformer block for video feature learning is proposed by combining spatial local and temporal attention.•A new family of video prediction Transformers is proposed, which reaches or outperforms complex SOTA ConvLSTM-based models.•It is the first paper that conducts a formal comparison of three different attention-based video prediction variants.
External IDs:dblp:journals/ivc/YeB23
Loading