Abstract: Video Corpus Moment Retrieval aims to retrieve the relevant video from a large corpus and localize the corresponding moment within the target video based on a specific query. Existing methods have achieved promising accuracy and efficiency through dedicated model designs. However, we suggest these methods overly exploit dataset biases instead of semantics. This will lead to exaggerated performances on biased datasets and implicit a significant deficiency in generalizability, which is an important metric yet not considered in existing studies. In this paper, we observe the degradation caused by a spurious dependency and design a model to mitigate this harm. Specifically, we generate an Out-Of-Distributed (OOD) test set from a widely used TV Retrieval dataset, revealing the existing models' erroneous dependency on the temporal locations of target moments. Therefore, we utilize a theoretical Structural Causal Model (SCM) to dig into the roots of this dependency by constructing causal paths for the models. Furthermore, we propose a concrete Clip Location Deconfounding Model (CLDM) to disentangle the confounded video features into the content part and the location confounder part, then produce results with causal intervention. Experiments show that CLDM significantly alleviates the impact brought by dataset biases thus providing advanced generalizability among existing works.
Loading