Deep Convolutional Neural Network for Correlating Images and Sentences

Yuhua Jia, Liang Bai, Peng Wang, Jinlin Guo, Yuxiang Xie

Published: 2018, Last Modified: 23 May 2023MMM (1) 2018Readers: Everyone

Abstract: In this paper, we address the problem of image sentence matching and propose a novel convolutional neural network architecture which includes three modules: the visual module for composing fragmental features of images, the textual module for composing fragmental features of sentences, and the fusional module for encoding features of image and sentence fragments jointly to generate final matching scores of image sentence pairs. Different with previous fragment level models, the proposed method represents fragments of images as feature maps generated by CNN, which is more reasonable and effective. By allowing independent and specialized fragmental feature representations to be leveraged for each modality like image or text, the proposed method is flexible in interlinking the intermediate fragmental features to generate a joint abstraction of two modalities, which provides better matching scores. Extensive evaluations on two benchmark datasets have validated the competitive performance of our approach compared to the state-of-the-art bidirectional image sentence retrieval approaches.

0 Replies