ETC: Towards Training-Efficient Video Synthesis with Exploiting Temporal Capabilities of Spatial Attention
Keywords: Efficient Video Generation, Video Diffusion Model
Abstract: Recently, synthesizing video from the text, i.e, Text-to-Video (T2V), has demonstrated remarkable progress by transferring the pre-trained Text-to-Image (T2I) diffusion models to the video domain, whose core is to add new temporal layers for capturing temporal information. However, these additional layers inevitably incur extra computational overhead, as they need to be trained from scratch on large-scale video datasets. Instead of retraining these costly layers, we conjecture whether temporal information can be learned from the original T2I model with only Spatial Attention. To this end, our theoretical and experimental explorations reveal that Spatial Attention has a strong potential for temporal modeling and greatly promotes training efficiency. Inspired by it, we propose ETC, a new T2V framework that achieves high fidelity and high efficiency in terms of training and inference. Specifically, to adapt the video to the spatial attention of T2I, we first design a novel temporal-to-spatial transfer strategy to organize entire video frames into a spatial grid. Then, we devise a simple yet effective Spatial-Temporal Mixed Embedding, to distinguish the inter-frame and intra-frame features. Benefiting from the above strategy that actually reduces the model's dependence on the text-video pairing dataset, we present a data-efficient strategy, Triple-Data (caption-image, label-image, and caption-video pairs) fusion that can achieve better performance with a small amount of video data for training. Extensive experiments show the superiority of our method over the four strong SOTA methods in terms of quality and efficiency, particularly improving FVD by 49% on average with only 1% training dataset.
Supplementary Material: zip
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 193
Loading