Abstract: Most existing solutions to temporal sentence video grounding (TSGV) rely heavily on local classifiers to discern start and end boundaries, often compromise internal consistency and overlook boundary uncertainty. This paper introduces a novel global ranking approach that directly scores all candidate proposals using a unique loss function, thereby enhancing robustness through the integrated decoding of local and global predictions. We further incorporate pretrained language models into our framework - a largely underexplored facet in TSGV. Our methodology is evaluated across three distinct settings: distribution-consistent, distribution-changing, and composition generalization datasets, outperforming existing baselines across the board. Notably, it exhibits superior performance in out-of-distribution and composition generalization tasks. To the best of our knowledge, we are the first to combine global proposal ranking and pretrained language models for robust TSVG.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: English
0 Replies
Loading