Abstract: Weakly-supervised Temporal Action Localization (WTAL) following a localization-by-classification paradigm has achieved significant results, yet still grapples with confounding arising from ambiguous snippets. Previous works have attempted to distinguish these ambiguous snippets from action snippets without investigating the underlying causes of their formation, thus failing to effectively eliminate the bias on both action-context and action-content. In this paper, we revisit WTAL from the perspective of structural causal model to identify the true origins of confounding, and propose an efficient dual-confounding eliminating framework to alleviate these biases. Specifically, we construct a Substituted Confounder Set (SCS) to eliminate the confounding bias on action-content by leveraging the modal disparity between RGB and FLOW. Then, a Multi-level Consistency Mining (MCM) method is designed to mitigate the confounding bias on action-content by utilizing the consistency between discriminative snippets and corresponding proposals at both the feature and label levels. Notably, SCS and MCM could be seamlessly integrated into any two-stream models without additional parameters by Expectation-Maximization (EM) algorithm. Extensive experiments on two challenging benchmarks including THUMOS14 and ActivityNet-1.2 demonstrate the superior performance of our method.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Engagement] Summarization, Analytics, and Storytelling, [Experience] Multimedia Applications
Relevance To Conference: This work advances multimedia/multimodal processing by addressing confounding issues in Weakly-supervised Temporal Action Localization (WTAL). It introduces an efficient dual-confounder eliminating framework. By leveraging modal disparity analysis and multi-level consistency mining, the framework enhances the accuracy of action localization in videos. This contributes to deeper insights into confounding biases and improves the robustness of WTAL algorithms. By integrating visual and motion information, the approach enriches multimodal processing, leading to more accurate temporal action localization.
Supplementary Material: zip
Submission Number: 4788
Loading