Abstract: Highlights•This paper proposes a novel Multi-Task Hierarchical Convolutional Network (MT-HCN) for visual-semantic cross-modal retrieval, which is characterized by adopting classification task to improve the hierarchical multi-modal representation learning.•This paper proposes a novel Self-Supervision Clustering (SSC) loss to learn the exterior representations that fully exploits low-level fine-grained correlation for associating images and texts.•This paper presents an effective bidirectional ranking loss, namely Harmonious Bidirectional Ranking (HBR) for cross-modal correlation preserving. It not only efficiently assists us to seek out more representative hard negative samples, but also leverages the category center of negatives to enhance the robustness of cross-modal representations.•Extensive experiments on two benchmark datasets validate the superiority of our proposed model in comparison to the state-of-the-art approaches.
Loading