Abstract: Video Grounding (VG) aims to identify the moment-of-interest in unedited videos according to given language queries. Zero-shot methods seek to achieve grounding results without labeled data, thereby demonstrating robust generalization capabilities and extensive applicability. However, existing zero shot approaches face limitations when matching queries with videos that contain rich information, primarily due to the intricate structure and semantics of the queries. Firstly, these methods tend to overlook the inherent temporal structure of events described by queries, which includes distinct phases: the initial, climax, and decay stages. By treating all phases of an event uniformly, critical video segments may be inadvertently omitted. Secondly, for queries exhibiting similar semantics, ambiguity can occur during the grounding process, compounded by the lack of reliable verification mechanisms to assess or correct these results. To tackle these challenges, we propose a two-stage method: Exploiting Prior Tacit Knowledge to Enhance Alignment and Verification. Specifically, in the first stage, we introduce a Temporal Structure-Informed Event Proposal Generation (TSPG) module,which capitalizes on the temporal structure of events to effectively filter video candidate segments that encapsulate critical phases of the events. In the second stage, we present a Temporal Consistency Verification and Recalibration (TCVR) module, designed to rigorously examine and refine the grounding results in accordance with the established prior semantic temporal order. Extensive ablation on two datasets demonstrates the superiority of our method.
Loading