Human:
Video 1 ✔ | Video 2 ✘
VB-SC:
Video 1 ✘ | Video 2 ✔
Score 1 : 0.9352 | Score 2 : 0.9358
Tracker-FG (Ours):
Video 1 ✔ | Video 2 ✘
Score 1 : 0.9947 | Score 2 : 0.9945
Vbench metric (VB-SC) fails to capture the finegrained distortions near the cat's face. In addition, VB-SC tends to favour videos with lesser camera motion. Our Tracker-FG captures fine-grained long term dependencies more effectively.
Human:
Video 1 ✔ | Video 2 ✘
VB-SC:
Video 1 ✘ | Video 2 ✔
Score 1 : 0.9442 | Score 2 : 0.9567
Tracker-FG (Ours):
Video 1 ✔ | Video 2 ✘
Score 1 : 0.9947 | Score 2 : 0.9926
Vbench metric (VB-SC) fails when there are multiple subjects in the scene, likely assigning every subject instance into a single feature. Tracker-FG tracks each subject individually and computes consistency.
Human:
Video 1 ✔ | Video 2 ✘
VB-SC:
Video 1 ✘ | Video 2 ✔
Score 1 : 0.5502 | Score 2 : 0.9379
Tracker-FG (Ours):
Video 1 ✔ | Video 2 ✘
Score 1 : 0.9812 | Score 2 : 0.9746
Vbench metric (VB-SC) is highly sensitive to camera motion as evident in its score (0.55 compared to values close to 0.95).
Human:
Video 1 ✔ | Video 2 ✘
VB-SC:
Video 1 ✘ | Video 2 ✔
Score 1 : 0.7120 | Score 2 : 0.8132
Tracker-FG (Ours):
Video 1 ✔ | Video 2 ✘
Score 1 : 0.9879 | Score 2 : 0.9513
The bias of camera motion on Vbench metric (VB-SC) is evident in its scores, whereas Tracker-FG purely focuses on object consistency.
Human:
Video 1 ✔ | Video 2 ✘
VB-SC:
Video 1 ✘ | Video 2 ✔
Score 1 : 0.8174 | Score 2 : 0.9828
Tracker-FG (Ours):
Video 1 ✔ | Video 2 ✘
The model in Video 2 is incapable of generating the primary object, thus we consider this video to be low quality. Object detection does not work for this scene. Therefore, we use object detection as the pairwise metric.