AVHash: Joint Audio-Visual Hashing for Video Retrieval

Yuxiang Zhou; Zhe Sun; Rui Liu; Yong Chen; Dell Zhang

AVHash: Joint Audio-Visual Hashing for Video Retrieval

Yuxiang Zhou, Zhe Sun, Rui Liu, Yong Chen, Dell Zhang

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Video hashing is a technique of encoding videos into binary vectors, facilitating efficient video storage and high-speed computation. Current approaches to video hashing predominantly utilize sequential frame images to produce semantic binary codes. However, videos encompass not only visual but also audio signals. Therefore, we propose a tri-level Transformer-based audio-visual hashing technique for video retrieval, named AVHash. It first processes audio and visual signals separately using pre-trained AST and ViT large models, and then projects temporal audio and keyframes into a shared latent semantic space using a Transformer encoder. Subsequently, a gated attention mechanism is designed to fuse the paired audio-visual signals in the video, followed by another Transformer encoder leading to the final video representation. The training of this AVHash model is directed by a video-based contrastive loss as well as a semantic alignment regularization term for audio-visual signals. Experimental results show that AVHash significantly outperforms existing video hashing methods in video retrieval tasks. Furthermore, ablation studies reveal that while video hashing based solely on visual signals achieves commendable mAP scores, the incorporation of audio signals can further boost its performance for video retrieval.

Primary Subject Area: [Engagement] Multimedia Search and Recommendation

Secondary Subject Area: [Content] Multimodal Fusion

Relevance To Conference: This paper sets itself apart from previous studies on video hashing by utilizing both audio and visual components of videos in a tri-level Transformer architecture, named AVHash, for creating binary embeddings of videos. Specifically, AVHash first maps separate audio and video signals to a shared latent semantic space, before projecting them onto the final video space. It then establishes a contrastive loss in the video space, along with a regularization constraint for aligning audio and visual signals in the common latent semantic space, to guide model training. Extensive experiments conducted on two widely used large video datasets demonstrate that AVHash significantly outperforms existing video hashing techniques in video retrieval tasks. Our findings indicate that, while a high mAP score for video retrieval could be achieved using visual signals alone in video hashing, incorporating audio signals effectively would further improve the system's performance.

Submission Number: 3160

Loading