Abstract: Text-to-image (T2I) models are rapidly advancing into creative practice and increasingly support generating illustrated storybooks, i.e., sequential and image-based narratives conditioned on written text. Previous surveys have examined challenges in video coherence or single-image fidelity. To our best knowledge, there is no comprehensive review that addresses the unique requirements of storybook illustration. This survey fills this gap by grounding the study of AI-illustrated storybook generation in a narratology framework. Specifically, this survey introduces a six-dimensional consistency model encompassing time, space, character, event and plot, style, and theme. For each dimension, we include consolidate definitions, representative methods, datasets, and evaluation metrics, thereby mapping the current landscape of the field. Building on this analysis, we further identify cross-dimensional failure modes and limitations of current approaches. Finally, we propose potential future research directions, including the development of book-scale integrated evaluation systems tailored for illustrated storybooks, more robust and controllable generation pipelines, enhanced multimodal semantic–visual alignment mechanisms, and the establishment of reader-oriented safety and educational guidelines.
External IDs:dblp:journals/air/LinWLZL26
Loading