Keywords: manga layout generation, multimodality, hierarchical structure prediction, layout modeling, visual narrative understanding
Abstract: Panel layout plays a crucial role in how manga pages convey narrative structure, influencing pacing, emphasis, and reading flow. Despite its importance, page-level layout has rarely been the primary modeling target in computational research. This paper investigates whether layout structure can be inferred from panel content representations and how it supports geometric layout generation. We compare visual features extracted from panel images with textual descriptions generated by a large multimodal model, using a unified framework to predict hierarchical layout. Our analysis reveals a consistent modality gap: visual representations enable more reliable layout inference, while textual descriptions provide weak structural cues. Based on these findings, we propose a two-stage framework: first predicting a layout tree and then generating panel bounding boxes conditioned on the predicted structure. This structure-conditioned generation improves geometric accuracy and degrades gracefully when using predicted rather than ground-truth trees. We also introduce Manga109Caption, a new dataset extending Manga109 with panel-level captions for 109 titles.Our programs and datasets are available from [anonymous link].
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, cross-modal application
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 7002
Loading