Keywords: Video Temporal Grounding; Moment Retrieval; Highlight Detection; Cross-modal alignment; Feature Aggregation
Abstract: Recent approaches to Video Temporal Grounding (VTG) predominantly rely on CLIP-based representations, often augmented with visual encoders such as SlowFast or C3D to enhance temporal modeling. However, the prevailing “concat-then-project” paradigm disrupts the inherent alignment between CLIP's visual and textual modalities and undermines the temporal modeling capabilities of the additional video encoder. To address these, we propose FDAP, a plug-and-play Feature Decoupling and Aggregation Paradigm. FDAP introduces two key components: a Textual-Guided Feature Decoupling Module (TGFDM) that preserves CLIP’s cross-modal alignment and SlowFast’s temporal modeling via independent attention maps, and a Dual-branch Feature Aggregation Module (DFAM) that dynamically adjusts feature weights during aggregation based on query-specific needs. Extensive experiments across four VTG methods (M-DETR, TR-DETR, CG-DETR, Flash-VTG) on three benchmark datasets (QVHighlights, Charades-STA, TACoS) demonstrate consistent performance gains, \emph{e.g.}, a 3\% improvement in M-DETR’s R1@0.7 metric. With minimal overhead (0.2M additional parameters), FDAP advances VTG feature modeling and generalizes effectively across diverse methods.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8228
Loading