Swin transformer-based supervised hashing

Published: 01 Jan 2023, Last Modified: 13 Nov 2024Appl. Intell. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: With the rapid development of the modern internet, image data are growing explosively. How to retrieve specific images from such big data has become an urgent problem. The common solution is the hash-based approximate nearest neighbor retrieval method, which uses compact binary hash codes to represent the original image data. When calculating the image similarity, it can quickly retrieve similar images by bit operation and requires only a small memory space to store hash codes. In recent years, the combination of deep learning and hash learning has led to breakthroughs in hash-based image retrieval methods. In particular, convolutional neural networks (CNNs) are widely used in various deep hashing methods. However, CNNs cannot capture global image information well when extracting image features, which affects the quality of the hash codes. Therefore, we first introduce the Swin Transformer network into hash learning and propose Swin Transformer-based supervised hashing (SWTH). Using the Swin Transformer as the feature extraction backbone network, we can capture the global context information of an image as much as possible by establishing the relations among different blocks of the image. Furthermore, the Swin Transformer adopts a hierarchical structure of layer-by-layer downsampling, which can obtain rich multiscale feature information while extracting global information. After the feature extraction network, we add a hash layer for hash learning. The image feature representation and hash function can be learned by optimizing the combination of hash loss, classification loss and quantization loss. Extensive experimental results show that the SWTH method outperforms many state-of-the-art methods and achieves excellent retrieval performance.
Loading