Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

Qualitative Results

1. Comparison with Baselines (T2V-turbo)

"1 bear and 2 people making pizza"
GIF 4
+ Ours (VideoRepair)
"A blue car parked next to a red fire hydrant on the street"
GIF 4
T2V-turbo
GIF 5
OPT2I
GIF 6
Vico
GIF 7
Ours (VideoRepair)

2. Comparison with Baselines (VideoCrafter2)

"A dog sitting under a umbrella on a sunny beach"
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)
"With the style of pointilism, A green apple and a black backpack."
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)

3. Iterative Refinement

"A family of four set up a tent and build a campfire, enjoying a night of camping under the stars"
GIF 1
Initial video (T2V-turbo)
GIF 2
Iter 1
GIF 4
Iter 2
"A bright yellow umbrella with a wooden handle. It's compact and easy to carry."
GIF 1
Initial video (T2V-turbo)
GIF 2
Iter 1
GIF 4
Iter 2