Keywords: sequential text-to-image retrieval, story-to-image retrieval, scene graph embedding, dual learning
Abstract: Sequential text-to-image retrieval, a.k.a. Story-to-images task, requires semantic alignment with a given story and maintaining global coherence in drawn image sequence simultaneously. Most of the previous works have only focused on modeling how to follow the content of a given story faithfully. This kind of overfitting tendency hinders matching structural similarity between images, causing an inconsistency in global visual information such as backgrounds. To handle this imbalanced problem, we propose a novel image sequence retrieval framework that utilizes scene graph similarities of the images and a dual learning scheme. Scene graph describes high-level information of visual groundings and adjacency relations of the key entities in a visual scene. In our proposed retriever, the graph encoding head learns to maximize graph embedding similarities among sampled images, giving a strong signal that forces the retriever to also consider morphological relevance with previously sampled images. We set a video captioning as a dual learning task that reconstructs the input story from the sampled image sequence. This inverse mapping gives informative feedback for our proposed retrieval system to maintain global contextual information of a given story. We also suggest a new contextual sentence encoding architecture to embed a sentence in consideration of the surrounding context. Through extensive experiments, Our proposed framework shows better qualitative and quantitative performance with Visual Storytelling benchmark compared to conventional story-to-image models.
One-sentence Summary: Using scene graph embedding and a dual learning to improve story-to-image retrieval.
5 Replies
Loading