Failure Cases: MS-Debias


Both Baseline and Our Metric Fail

Example 1

Human:
Video 1 | Video 2

VB-BG:
Video 1 | Video 2
Score 1 : 0.3532 | Score 2 : 0.7882

MS-Debias (Ours):
Video 1 | Video 2
Score 1 : 0.4205 | Score 2 : 0.7999

Very gradual variations in background (looking closer at the tree house details) are still to be captured by our metric.


Example 2

Human:
Video 1 | Video 2

VB-BG:
Video 1 | Video 2
Score 1 : 0.1760 | Score 2 : 0.3212

MS-Debias (Ours):
Video 1 | Video 2
Score 1 : 0.1988 | Score 2 : 0.5705

A similar case to Example 1.


Example 3

Human:
Video 1 | Video 2

VB-BG:
Video 1 | Video 2
Score 1 : 0.1361 | Score 2 : 0.3500

MS-Debias (Ours):
Video 1 | Video 2
Score 1 : 0.0683 | Score 2 : 0.6574

As our metric objectively evaluates consistency, it has limited awareness of what is real. Evaluating Video 2 requires a deeper understanding of the scene context.


Our Metric Fails but Baseline Works

Example 4

Human:
Video 1 | Video 2

VB-BG:
Video 1 | Video 2
Score 1 : 0.6904 | Score 2 : 0.6857

MS-Debias (Ours):
Video 1 | Video 2
Score 1 : 0.4186 | Score 2 : 0.5092

VB-BG considers deformable regions like water to be consistent, while MS-Debias focuses too much on the details of deformable regions.