Abstract: Video Corpus Moment Retrieval aims to select a temporal video moment pertinent to a given language query from a large video corpus. Existing systems are prone to rely on a retrieval bias as a shortcut, which hinders the systems from accurately learning vision-language association. The retrieval bias is spurious correlations between query and scene. For a given query, systems tend to retrieve incorrectly correlated scenes due to biased annotations that have predominant binding in a dataset. To this end, we present a Counterfactual Two-stage Debiasing Learning (CTDL), which incorporates a counterfactual bias network that intentionally learns the retrieval bias by providing a shortcut to learn the spurious correlation between keyword and scene, and performs two-stage debiasing learning that mitigates the bias via contrasting factual retrievals with counterfactually biased retrievals. Extensive experiments show the effectiveness of CTDL paradigm.
0 Replies
Loading