Story2Screen: Multimodal Story Customization for Long Consistent Visual Sequences

10 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Image Generation;
Abstract: Multimodal Story Customization aims to generate coherent story flows while conditioning on both textual descriptions and reference identity images. While recent progress in story generation has shown promising results, most existing approaches rely on text-only inputs, with a few works incorporating character identity cues (e.g., facial ID) but lacking broader multimodal conditioning. This limited reliance makes it difficult to jointly preserve consistency of characters, scenes, and textual details across frames, thereby constraining the applicability of these approaches in practical domains such as filmmaking, advertising, and storytelling. In this work, we introduce Story2Screen, a multimodal framework that integrates free-form description with character and background references to enable coherent and customizable story generation. To enhance cinematic diversity, we introduce shot-type control via parameter-efficient prompt tuning on movie data, enabling the model to generate sequences that more faithfully reflect real-world cinematic grammar. To comprehensively evaluate our framework, we establish two new benchmarks, MSB and M$^2$SB, which assess multimodal story customization from the perspectives of character/scene consistency, text–visual alignment, and shot-type control. Extensive experiments demonstrate Story2Screen achieves improved consistency and cinematic diversity compared to existing methods.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 3570
Loading