Abstract: Language-driven action localization is a challenging task that aims to identify action boundaries, namely the start and end timestamps, within untrimmed videos using natural language queries. Previous studies have made significant progress by extensively investigating cross-modal interactions between linguistic and visual modalities. However, the computational demands imposed by untrimmed and lengthy videos remain substantial, necessitating the development of more efficient algorithms. In this paper, we propose an efficient algorithm to address this computational challenge by aggregating adjacent similar redundant frame features. Specifically, we fuse neighboring frames based on their semantic similarity to the provided language query, facilitating the identification of relevant video segments while effectively managing computational complexity. To enhance localization accuracy, we introduce a prediction adjustment module that expands the fused frames, enabling a more precise determination of the action boundaries. Moreover, our method is model-agnostic and can be easily integrated with existing methods, functioning as a plugin-and-play solution. Extensive experimentation on two widely-used benchmark datasets (Charades-STA and TACoS) demonstrates the effectiveness and efficiency of our method.
Loading