Abstract: Temporal Sentence Grounding (TSG), which aims to localize events in untrimmed videos with a given language query, has been widely
studied in the last decades. However, recently researchers have demonstrated that previous approaches are severely limited in out-
of-distribution generalization, thus proposing the De-biased TSG challenge which requires models to overcome weakness towards
outlier test samples. In this paper, we design a novel framework, termed Counterfactually-Augmented Event Matching (CAEM),
which incorporates counterfactual data augmentation to learn event-query joint representations to resist the training bias. Specifically, it
consists of three components: (1) A Temporal Counterfactual Augmentation module that generates counterfactual video-text pairs by
temporally delaying events in the untrimmed video, enhancing the model’s capacity for counterfactual thinking. (2) An Event-Query Matching model that is used to learn joint representations and predict corresponding matching scores for each event candidate. (3) A Counterfact-Adaptive Framework (CAF) that incorporates the counterfactual consistency rules on the matching process of the same event-query pairs, furtherly mitigating the bias learned from training sets. We conduct thorough experiments on two widely used DTSG datasets, i.e., Charades-CD and ActivityNet-CD, to evaluate our proposed CAEM method. Extensive experimental results show our proposed CAEM method outperforms recent state-of-the-art methods on all datasets. Our implementation code is available at https://github.com/CFM-MSG/CAEM_Code.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Engagement] Summarization, Analytics, and Storytelling
Relevance To Conference: The Temporal Sentence Grounding (TSG) task aims at localizing video events from untrimmed long videos in terms of given text queries. As a typical multimodal video understanding task, the TSG task has attracted wide attention from both academic research and industrial applications. However, recent studies have demonstrated that previous TSG methods suffer from severe defects in generalization ability towards outlier test samples, thus the De-biased TSG task is proposed. We propose a novel DTSG method in this work, aiming to improve the generalization ability of the TSG method on Out-Of-Distribution video-text input pairs. All contributions are contributed to multimodal processing, e.g., learning generalizable event-query joint representations in the TSG methods, which are well-matched in the scope of the ACM Multimedia Conference.
Supplementary Material: zip
Submission Number: 1937
Loading