Abstract: The attention-based image caption framework has been widely explored in recent years. However, most techniques generate next word conditioned on previous words and current visual contents, while the relationship between the semantic and visual contents is not considered. In this paper, we present a novel framework which can explore the relevance and coherence at the same time. The relevance tries to explore the relationship between the semantic and visual contents in a semantic-visual embedding space, and the coherence is introduced to maximize the probability of generating the next word according to previous words and the current visual contents. The performance of our model is tested with three benchmark datasets: Flickr8k, Flickr30k and MS COCO. The experimental results show that the proposed approach can improve the performance of attention-based image caption method.
Loading