Unsupervised Multimodal Graph Contrastive Semantic Anchor Space Dynamic Knowledge Distillation Network for Cross-Media Hash Retrieval

Yang Yu, Meiyu Liang, Mengran Yin, Kangkang Lu, Junping Du, Zhe Xue

Published: 01 Jan 2024, Last Modified: 04 Mar 2025ICDE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Cross-media hash retrieval are efficient and effective techniques for retrieval on multi-media database. The success of the Multimodal Large Models (MLM) provides a valuable direction to enhance the accuracy of multimodal hash retrieval, which achieves decent retrieval accuracy with finetuning the pretrained multimodal large models, but their massive model parameters significantly reduce retrieval efficiency. Knowledge Distillation (KD) methods enable small models to learn from the knowledge of larger models, achieving a reduction in model parameter count while ensuring a certain level of accuracy. However, current KD methods face challenges when applied in the multimodal domain, as it requires preserving the multimodal semantic information while minimizing accuracy degradation. To address these challenges, we propose a novel unsupervised multimodal graph contrastive semantic anchor space dynamic knowledge distillation network for cross-media hash retrieval (GASKN). Firstly, to obtain a multimodal semantic anchor space, we construct a large multimodal fusion teacher model using the BEiT-3 model as the backbone. This teacher model is capable of encoding data from different modalities, such as images and text, using the same multimodal encoder to acquire multimodal hash codes that contain rich information from both modalities simultaneously. Secondly, to ensure efficient retrieval capabilities for the student model, we utilize the ALBERT text encoding model and the BiFormer image encoding model as the compact student model's backbones. This allows us to build a lightweight student model with only a twentieth of the parameter count of the teacher model. We propose a dynamic knowledge distillation technique to transfer the multimodal semantic anchor space knowledge embedded in the multimodal large teacher model to the lightweight student model as much as possible. Thirdly, to further distill the structural knowledge of the semantic anchor space from the teacher model to the student model, we propose a graph attention contrastive learning mechanism, which enables structural semantic space learning, thereby mining implicit fine-grained cross-media semantic information. By evaluating our method using three widely-used datasets, we demonstrate that GASKN is able to significantly outperform existing state-of-the-art hashing algorithms.