Pairwise Comparisons: Foreground Object Consistency


Example 1

Human:
Video 1 | Video 2

VB-SC:
Video 1 | Video 2
Score 1 : 0.9352 | Score 2 : 0.9358

Tracker-FG (Ours):
Video 1 | Video 2
Score 1 : 0.9947 | Score 2 : 0.9945

Vbench metric (VB-SC) fails to capture the finegrained distortions near the cat's face. In addition, VB-SC tends to favour videos with lesser camera motion. Our Tracker-FG captures fine-grained long term dependencies more effectively.


Example 2

Human:
Video 1 | Video 2

VB-SC:
Video 1 | Video 2
Score 1 : 0.9442 | Score 2 : 0.9567

Tracker-FG (Ours):
Video 1 | Video 2
Score 1 : 0.9947 | Score 2 : 0.9926

Vbench metric (VB-SC) fails when there are multiple subjects in the scene, likely assigning every subject instance into a single feature. Tracker-FG tracks each subject individually and computes consistency.


Example 3

Human:
Video 1 | Video 2

VB-SC:
Video 1 | Video 2
Score 1 : 0.5502 | Score 2 : 0.9379

Tracker-FG (Ours):
Video 1 | Video 2
Score 1 : 0.9812 | Score 2 : 0.9746

Vbench metric (VB-SC) is highly sensitive to camera motion as evident in its score (0.55 compared to values close to 0.95).


Example 4

Human:
Video 1 | Video 2

VB-SC:
Video 1 | Video 2
Score 1 : 0.7120 | Score 2 : 0.8132

Tracker-FG (Ours):
Video 1 | Video 2
Score 1 : 0.9879 | Score 2 : 0.9513

The bias of camera motion on Vbench metric (VB-SC) is evident in its scores, whereas Tracker-FG purely focuses on object consistency.


Cases where primary objects are not generated from the same prompt

Example 5

Human:
Video 1 | Video 2

VB-SC:
Video 1 | Video 2
Score 1 : 0.8174 | Score 2 : 0.9828

Tracker-FG (Ours):
Video 1 | Video 2

The model in Video 2 is incapable of generating the primary object, thus we consider this video to be low quality. Object detection does not work for this scene. Therefore, we use object detection as the pairwise metric.