Alignment-Enhanced Network for Temporal Language Grounding in Videos

Published: 01 Jan 2024, Last Modified: 08 Apr 2025ICANN (3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Temporal language grounding in videos aims to ground one video segment in an untrimmed video based on a given sentence query. The main challenge in this task lies in how to align the video and textual modalities effectively. Most existing methods only perform a single interaction at the early stage and overlook the information gap between the video and textual modalities, making it difficult to finely align their representations. In this paper, we propose an efficient network namely Alignment-Enhanced Network (AENet). It consists of a backbone that utilizes a Multi-step Gradual Fusion mechanism (MGFNet) and a framework that utilizes a Semantic Association Distillation strategy (SADF). Specifically, the MGFNet begins with a coarse co-attention mechanism to grasp global information at the early stages, followed by a series of co-attention Transformer encoder layers to mine fine-grained cues. The SADF employs knowledge distillation, featuring a teacher model enhanced by additional relevant queries and a student model that learns from the teacher using a single query through distillation loss. By integrating MGFNet backbone and SADF framework, our AENet achieves improved cross-modal alignment. Extensive experiments on TACoS and Charades-STA datasets demonstrate that our solution outperforms existing state-of-the-art methods.
Loading