We present sample results of our method.
We present sample results of our method on top of SDEdit ([7]).
| "a shiny silver robot" | Per-frame SDEdit [6] | TokenFlow + SDEdit [6] |
|---|---|---|
| "an ice sculpture" | ||
We present sample results of our method on top of ControlNet image synthesis ([9]).
| "a colorful oil painting of a wolf" | Per-frame ControlNet [6] | TokenFlow + ControlNet [6] |
|---|---|---|
| "an anime of a man in a field" | ||
We present sample results of our method combined with post process de-flickering.
Existing methods of text-guided video editing suffer from temporal inconsistency.
| "rainbow textured dog" | Ours | Text-to-video ([1]) | TAV ([2]) |
|---|---|---|---|
| Gen1 ([3]) | PnP per frame ([4]) | Fate-Zero ([8]) | Re-render a Video ([10]) |
| "an origami of a stork" | Ours | Text-to-video ([1]) | TAV ([2]) |
|---|---|---|---|
| Gen1 ([3]) | PnP per frame ([4]) | Fate-Zero ([8]) | Re-render a Video ([10]) |
| "a metal sculpture" | Ours | Text-to-video ([1]) | TAV ([2]) |
|---|---|---|---|
| Gen1 ([3]) | PnP per frame ([4]) | Fate-Zero ([8]) | Re-render a Video ([10]) |
| "a fluffy wolf doll" | Ours | Text-to-video ([1]) | TAV ([2]) |
|---|---|---|---|
| Gen1 ([3]) | PnP per frame ([4]) | Fate-Zero ([8]) | Re-render a Video ([10]) |
We present additional qualitative comparisons of our method with Text2LIVE ([5]) and Ebsynth ([6]).
Text2live lacks a strong generative prior, thus has a poor visual quality. Ebsynth performs well on video frames close to the edited keyframe, but either fails to propagate the edit to the rest of the video or introduces artifacts.
| "a car in s snowy scene" | Ours | Text2LIVE ([5]) | Ebsynth ([6]) |
|---|---|---|---|
We ablate tokenflow propagation and keyframe randomization .
| "a colorful polygonal illustration" | Ours | Ours, constant keyframes | Extended attention, random keyframes |
|---|---|---|---|
| "a rainbow textured dog" | |||
We present the feature PCA visualisation of the original video featuers, of the features of a video edited by ([5]), and of the features of frames edited by our method. Different rows show features from different layers of the Unet decoder.
| original video | Ours | Per frame editing |
|---|---|---|
[1] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
[2] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022
[3] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023
[4] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text- driven image-to-image translation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
[5] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In European Conference on Computer Vision. Springer, 2022.
[6] Ondˇrej Jamriˇska, ˇS ́arka Sochorov ́a, Ondˇrej Texler, Michal Luk ́aˇc, Jakub Fiˇser, Jingwan Lu, Eli Shechtman, and Daniel S ́ykora. Stylizing video by example. ACM Transactions on Graphics, 2019.
[7] henlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. In International Conference on Learning Representa- tions, 2022.
[8] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535, 2023 .
[9] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023.
[10] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to- video translation, 2023.