Videos for Ablation Studies
On this page, we present videos from different stages of our framework, illustrating its progressive improvements. Stages 1 and 2 effectively capture motion representations but lack fine-grained visual details. Stage 3 refines these details; however, hand artifacts remain. With the addition of hand refinement, our complete method produces visually coherent and artifact-free videos. Moreover, we observe that as hand quality improves, overall visual quality—including facial detail—also benefits. This suggests a strong correlation between hand realism and the overall perceptual quality of generated videos.
Additionally, we present results from Stage 2 without the invertible feature extractor, demonstrating its necessity in our framework. We also provide results where generation relies solely on audio, further emphasizing the limitations of audio-only video synthesis. These experiments validate the effectiveness of our method by underscoring the importance of motion representation learning.
|
Stage 1 Stage 2 Stage 2 W/o IFE Stage 3 Ours |
Stage 1 Stage 2 Stage 2 W/o IFE Stage 3 Ours |
Stage 1 Stage 2 Stage 2 W/o IFE Stage 3 Ours |
Stage 1 Stage 2 Stage 2 W/o IFE Stage 3 Ours |
W/o Motion Information Ours |
W/o Motion Information Ours |
W/o Motion Information Ours |
|---|