CrossHash: Cross-scale Vision Transformer Hashing for Image Retrieval

Published: 2025, Last Modified: 23 Jan 2026ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Transformers have made significant progress in dealing with computer vision tasks. However, most existing Vision Transformers (ViT) typically focus on single-scale information, limiting their capability to model interactions when processing multi-scale features. Moreover, the explicit mapping of continuous real-valued features to discrete hashing codes via a quantization layer is suboptimal for retrieval tasks. To overcome the above problem, we propose a novel deep hashing based on a Cross-scale Transformer named CrossHash, aiming to extract multi-scale features. Furthermore, a relative similarity quantization method is first introduced, which maximizes the similarity between the relative positional representation of continuous codes and the normalized centroids; this method effectively reduces quantization error. It optimizes feature distribution via contrastive learning loss, maximizing the inter-class distance and minimizing the intra-class distance of the learned feature. Extensive experiments on three benchmark datasets demonstrate that the proposed model outperforms other state-of-the-art deep hashing methods. Source code is available https://github.com/wwg1010/CrossHash.
Loading