Our motion embedding functionally operates like positional embedding, which is found in almost all video generative models. This means our motion embeddings can be easily applied to many different video generative models using common techniques.
The first three slides are based on ZeroScope, and the last slide is based on AnimateDiff.
Compare with Tune-A-Video
Source Video
Tune-A-Video
Ours
A motor driving in the desert
Source Video
Tune-A-Video
Ours
A giraffe walking in the zoo.
Embedding Ablation
Source Video
w/ Motion QK Emb, w/o Motion V Emb
w/ Motion QK & V Emb.
A dessert shot with pan right.
Source Video
w/ Motion QK Emb, w/o Motion V Emb
w/ Motion QK & V Emb.
A tiger walking in the forest.
Structure Ablation
Source Video
1D QK & 1D V
2D QK & 1D V
2D QK & 2D V
Ours - 1D QK & 2D V
A dessert shot with pan right.
Source Video
1D QK & 1D V
2D QK & 1D V
2D QK & 2D V
Ours - 1D QK & 2D V
A tiger walking in the forest.
Comparison of Loss and Optimization Target
Source Video (Mosaic for blocking out sensitive info)
Motion Director with MSE loss
Motion Director with Hybrid loss
Ours with MSE loss
Ours with Hybrid loss
A firefighter standing in front of a burning forest.
Source Video
Motion Director with MSE loss
Motion Director with Hybrid loss
Ours with MSE loss
Ours with Hybrid loss
An elephant walking on the rock.
Effect of Different Numbers of DDIM Inversion Steps
Source Video
A goose walking on the field. (From left to right, using ddim = 50, 45, 40, 35, 30 separately)
Effect of Positional Embedding of Video Generation Models
Generated 16 frames video (High motion intensity)
Generated 24 frames video without extrapolating Positional Embeddings
Generated 24 frames video with extrapolating Positional Embeddings(Low motion intensity)