HCFMN: Hierarchical Cross-Modal Fine-Grained Mining Network for Temporal Sentence Grounding

Published: 01 Jan 2025, Last Modified: 05 Nov 2025IEEE Trans. Multim. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Temporal Sentence Grounding (TSG) requires a thorough understanding of the complex cross-modal semantic relationships between videos and text. However, existing methods fail to accurately capture content at diverse granularity levels with distinct semantics, making it difficult to achieve fine alignment of visuals and text. To overcome this issue, we attempt to mine for rich semantic clues by utilizing the hierarchical correspondence structure and multi-granularity visual-to-text reconstruction, achieving fine-grained reasoning. Specifically, for the TSG task, we propose a novel Hierarchical Cross-modal Fine-grained Mining Network (HCFMN), which utilizes an attention mechanism based on temporal hierarchical relationships to extract temporal features corresponding to the text of different granularities. We leverage the reconstructability of visual-to-text, recovering multi-granularity textual content from coarse to fine by focusing on temporal features at different layers, hierarchically extracting temporal features and the dependencies related to the text, and implementing fine-grained cross-modal semantic alignment. Furthermore, HCFMN introduces a novel partitioned efficient attention mechanism, which significantly enhances the model’s efficiency through a two-stage attention based on sequence and channel compression. Extensive experimental results on three public datasets (ActivityNet-Captions, TACoS, and Charades-STA) demonstrate that the proposed method achieves state-of-the-art performance.
Loading