Motion Inversion for Video Customization Supp

Source Video (Orbit shot)

"A rabbit, low poly game art style."

Source Video (Crane up shot)

"An island by the sea."

Source Video

"A robot is dancing."

Source Video

"Monkeys are playing coconut."

More results

Source Video (Orbit shot)

"A house, 3d style."

Source Video

"Skeleton in suit is dancing, in autumn."

Source Video (Crane up shot)

"Ice on the sea in sunset."

Source Video

"A tiger doing pull-ups in the forest."

Source Video (Custom shot)

"A high-tech chip."

Source Video

"A dragon sitting in a flora garden."

Motion Embeddings for ZeroScope and AnimateDiff

Our motion embedding functionally operates like positional embedding, which is found in almost all video generative models. This means our motion embeddings can be easily applied to many different video generative models using common techniques.
The first three slides are based on ZeroScope, and the last slide is based on AnimateDiff.


Compare with Tune-A-Video

Source Video

Tune-A-Video

Ours

A motor driving in the desert

Source Video

Tune-A-Video

Ours

A giraffe walking in the zoo.

Embedding Ablation

Source Video

w/ Motion QK Emb, w/o Motion V Emb

w/ Motion QK & V Emb.

A dessert shot with pan right.

Source Video

w/ Motion QK Emb, w/o Motion V Emb

w/ Motion QK & V Emb.

A tiger walking in the forest.

Structure Ablation

Source Video

1D QK & 1D V

2D QK & 1D V

2D QK & 2D V

Ours - 1D QK & 2D V

A dessert shot with pan right.

Source Video

1D QK & 1D V

2D QK & 1D V

2D QK & 2D V

Ours - 1D QK & 2D V

A tiger walking in the forest.

Comparison of Loss and Optimization Target

Source Video (Mosaic for blocking out sensitive info)

Motion Director with MSE loss

Motion Director with Hybrid loss

Ours with MSE loss

Ours with Hybrid loss

A firefighter standing in front of a burning forest.

Source Video

Motion Director with MSE loss

Motion Director with Hybrid loss

Ours with MSE loss

Ours with Hybrid loss

An elephant walking on the rock.

Effect of Different Numbers of DDIM Inversion Steps

Source Video

A goose walking on the field. (From left to right, using ddim = 50, 45, 40, 35, 30 separately)

Effect of Positional Embedding of Video Generation Models

Generated 16 frames video (High motion intensity)
Generated 24 frames video without extrapolating Positional Embeddings
Generated 24 frames video with extrapolating Positional Embeddings(Low motion intensity)