DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models

Patrick Kwon, Chen Chen

Published: 01 Dec 2025, Last Modified: 07 Mar 2026CVPR 2026EveryoneCC BY 4.0

Abstract: Existing story visualization systems often rely on text-only control, which makes it difficult to decide where multiple characters should appear and to keep their visual appearance consistent across panels. DreamingComics addresses this by introducing a layout-aware framework that jointly reasons about subject identity, spatial layout, and artistic style for comic-style stories. We build on a pretrained video diffusion transformer and repurpose it for image customization, leveraging its spatiotemporal priors to improve identity and style consistency across generated panels. To control spatial layout, we introduce RegionalRoPE, a region-aware rotary position embedding that re-indexes reference tokens according to target bounding boxes, and a masked condition loss that penalizes attention that leaks outside the designated regions. Complementing this, an LLM-based layout generator is fine-tuned on comic layout data to predict panel and character boxes directly from textual scripts, reducing the need for manual layout design. On benchmarks such as ViStoryBench and DreamBench++, DreamingComics improves character consistency and style similarity over prior story visualization and image customization methods, while maintaining high spatial accuracy.