Boundary-Aware Temporal Dynamic Pseudo-Supervision Pairs Generation for Zero-Shot Natural Language Video Localization

Published: 01 Jan 2025, Last Modified: 20 Jul 2025AAAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Zero-shot Natural Language Video Localization (NLVL) aims to automatically generate moments and corresponding pseudo queries from raw videos for the training of the localization model without any manual annotations. Existing approaches typically produce pseudo queries as simple words, which overlook the complexity of queries in real-world scenarios. Considering the powerful text modeling capabilities of large language models (LLMs), leveraging LLMs to generate complete queries that are closer to human descriptions is a potential solution. However, directly integrating LLMs into existing approaches introduces several issues, including insensitivity, isolation, and lack of regulation, which prevent the full exploitation of LLMs to enhance zero-shot NLVL performance. To address these issues, we propose BTDP, an innovative framework for Boundary-aware Temporal Dynamic Pseudo-supervision pairs generation. Our method contains two crucial operations: 1) Boundary Segmentation that identifies both visual boundaries and semantic boundaries to generate the atomic segments and activity descriptions, tackling the issue of insensitivity. 2) Context Aggregation that employs the LLMs with a self-evaluation process to aggregate and summarize global video information for optimized pseudo moment-query pairs, tackling the issue of isolation and lack of regulation. Comprehensive experimental results on the Charades-STA and ActivityNet Captions datasets demonstrate the effectiveness of our BTDP method.
Loading