MAGIC: Multimodal Story Generation from Image Collections

ACL ARR 2026 January Submission4648 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: data-to-text generation, multimodal story generation
Abstract: We introduce a new task, Multimodal Story Generation from Image Collections (MAGIC), where the goal is to generate a coherent narrative conditioned on a small subset of images that must be retrieved from a large, unordered collection. This underexplored task reflects real-world constraints where stories must be created using limited visual assets. To address the dual challenges of strategically selecting narratively useful images and constructing a coherent, visually grounded story, we propose a narrative-centric framework that first selects a diverse yet compatible subset of images, next infers their temporal and causal ordering, then bridges narrative gaps, and finally expands into a full story. We also build a dataset with 4,000 images to support this new task. Extensive automated and human evaluations show that our approach significantly outperforms baseline methods in narrative coherence, logical sense, novelty, naturalness, and visual engagement, establishing a strong foundation for multimodal storytelling under realistic resource constraints.
Paper Type: Long
Research Area: Natural Language Generation
Research Area Keywords: data-to-text generation
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 4648
Loading