Abstract: Integrating several types of features into a highly discriminating and robust representation for image retrieval remains challenging. A majority of image retrieval methods are based on deep or semantic features extracted from deep neural networks and gradually ignore handcrafted features. Importantly, the accuracy of uniting different types of features is not necessarily higher than that of using a single feature. Therefore, this study proposes a novel method to combine handcrafted, deep, and semantic features into a compact representation: deep cross-semantic features (DCSF). Its major highlights are described as follows. 1. This research presents a global coarse target filter that integrates the global features of an image (e.g., color and texture), effectively smoothing out the coarse targets in the deep feature maps and reducing their interference in representation. 2. This paper explores the discriminative information of convolutional neural networks to introduce a local fine target detector. It can highlight the fine targets in the deep feature maps and elevate the discriminative power of deep features. 3. This study blends the deep features of convolutional neural network and the semantic discriminative features of the Swin Transformer network to create a highly discriminating representation. This method successfully produces a representation according to global, local, and semantic information. It preliminarily addresses the issue of combining different types of features that may lead to performance degradation. The presented approach can yield highly competitive retrieval performance on the Oxford5K, Oxford105K, Paris6K, Paris106K, and Holidays datasets with respect to mean average precision metric.
External IDs:doi:10.1016/j.eswa.2024.126157
Loading