Learning Dynamic Similarity By Bidirectional Hierarchical Sliding Semantic Probe For Efficient Text Video Retrieval

Yang Liu

Published: 09 Dec 2024, Last Modified: 12 Dec 2024AAAI 2025EveryoneCC BY 4.0

Abstract: Text-video retrieval is a foundation task in multi-modal research which aims to align texts and videos in the embedding space. The key challenge is to learn the similarity between videos and texts. A conventional approach involves directly aligning video-text pairs using cosine similarity. However, due to the disparity in the information conveyed by videos and texts—where a single video can be described from multiple perspectives—the retrieval accuracy of this method is suboptimal. An alternative approach employs cross-modal interaction to enable videos to dynamically acquire distinct features from various texts, thus facilitating similarity calculations. Nevertheless, this solution incurs a computational complexity of $O(n^2)$ during retrieval. To address these issues, this paper proposes a novel method called Bidirectional Hierarchical Sliding Semantic Probe (BiHSSP), which calculates dynamic similarity between videos and texts with $O(n)$ complexity during retrieval. We introduce a hierarchical semantic detection module that learns semantic detections at different scales for both video and text features. Semantic detection involves a sliding calculation of the cross-correlation between semantic detections at different scales and embeddings from another modality, allowing for dynamic similarity computation between video and text descriptions from various perspectives. Specifically, for text descriptions from different angles, we calculate the similarity at different positions within the video features. This approach preserves the complete features of the video while addressing the issue of unequal information between video and text without requiring cross-modal interaction. Additionally, our method can function as a plug-and-play module across various methods, thereby enhancing their performance. Experimental results demonstrate that our BiHSSP significantly outperforms the baseline, achieving a 2.7% to 4.4% improvement in R@1.