Similarity Preserving Transformer Cross-Modal Hashing for Video-Text Retrieval

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: As social networks grow exponentially, there is an increasing demand for video retrieval using natural language. Cross-modal hashing that encodes multi-modal data using compact hash code has be widely used in large-scale image-text retrieval, primarily due to its computation and storage efficiency. When applied to video-text retrieval, existing unsupervised cross-modal hashing extracts the frame- or word-level features individually, and thus ignores long-term dependencies. In addition, effective exploit of multi-modal structure poses a significant challenge due to intricate nature of video and text. To address the above issues, we propose Similarity Preserving Transformer Cross-Modal Hashing (SPTCH), a new unsupervised deep cross-modal hashing method for video-text retrieval. SPTCH encodes video and text by bidirectional Transformer encoder that exploits their long-term dependencies. SPTCH constructs a multi-modal collaborative graph to model correlations among multi-modal data, and applies semantic aggregation by employing Graph Convolutional Network (GCN) on such graph. SPTCH designs unsupervised multi-modal contrastive loss and neighborhood reconstruction loss to effectively exploit inter- and intra-modal similarity structure among videos and texts. The empirical results on three video benchmark datasets demonstrate that the proposed SPTCH generally outperforms state-of-the-arts in video-text retrieval.
Primary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: As social networks grow exponentially, there is an increasing demand for video retrieval using natural language. This work proposes a new cross-modal hashing method, i.e., Similarity Preserving Transformer Cross-Modal Hashing that can model video-text multi-modal data well without label supervision. This work contributes to unsupervised multi-modal data processing, and can well support fast video-text retrieval.
Submission Number: 4973
Loading