Keywords: visual storytelling, computer vision, natural language generation, large language models
TL;DR: We present a visual storytelling framework inspired by narrative theories, and evaluate our generated stories for visual novelty and reader willingness to read more.
Abstract: We propose a visual storytelling framework with a distinction between what is present and observable in the visual storyworld, and what story is ultimately told. We implement a model that tells a story from an image using three affordances: 1) a fixed set of visual properties in an image that constitute a holistic representation its contents, 2) a variable stage direction that establishes the story setting, and 3) incremental questions about character goals. The generated narrative plans are then realized as expressive texts using few-shot learning. Following this approach, we generated 64 visual stories and measured the preservation, loss, and gain of visual information throughout the pipeline, and the willingness of a reader to take action to read more. We report different proportions of visual information preserved and lost depending upon the phase of the pipeline and the stage direction's apparent relatedness to the image, and report 83% of stories were found to be interesting.
Submission Type: archival
Presentation Type: onsite
Presenter: Stephanie Lukin and Sungmin Eum