Towards Visual Storytelling by Understanding Narrative Context Through Scene-Graphs

Itthisak Phueaksri, Marc A. Kastner, Yasutomo Kawanishi, Takahiro Komamizu, Ichiro Ide

Published: 2025, Last Modified: 06 Mar 2025MMM (4) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: VIsual STorytelling (VIST) is a task that transforms a sequence of images into narrative text stories. A narrative story requires an understanding of the contexts and relationships among images. Our study introduces a story generation process that emphasizes creating a coherent narrative by constructing both image and narrative contexts to control the coherence. First, the image contexts are generated from the content of individual images, using image features and scene graphs that detail the elements of the images. Second, the narrative context is generated by focusing on the overall image sequence. Ensuring that each caption fits within the overall story maintaining continuity and coherence. We also introduce a narrative concept summary, which is external knowledge represented as a knowledge graph. This summary encapsulates the narrative concept of an image sequence to enhance the understanding of its overall content. Following this, both image and narrative contexts are used to generate a coherent and engaging narrative. This framework is based on Long Short-Term Memory (LSTM) with an attention mechanism. We evaluate the proposed method using the VIST dataset, and the results highlight the importance of understanding the context of an image sequence in generating coherent and engaging stories. The study demonstrates the significance of incorporating narrative context into the generation process to ensure the coherence of the generated narrative.