Tree-based Manga Layout Generation: A Comparative Analysis of Visual and Textual Features

Tree-based Manga Layout Generation: A Comparative Analysis of Visual and Textual Features

ACL ARR 2026 January Submission7002 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: manga layout generation, multimodality, hierarchical structure prediction, layout modeling, visual narrative understanding

Abstract: Panel layout plays a crucial role in how manga pages convey narrative structure, influencing pacing, emphasis, and reading flow. Despite its importance, page-level layout has rarely been the primary modeling target in computational research. This paper investigates whether layout structure can be inferred from panel content representations and how it supports geometric layout generation. We compare visual features extracted from panel images with textual descriptions generated by a large multimodal model, using a unified framework to predict hierarchical layout. Our analysis reveals a consistent modality gap: visual representations enable more reliable layout inference, while textual descriptions provide weak structural cues. Based on these findings, we propose a two-stage framework: first predicting a layout tree and then generating panel bounding boxes conditioned on the predicted structure. This structure-conditioned generation improves geometric accuracy and degrades gracefully when using predicted rather than ground-truth trees. We also introduce Manga109Caption, a new dataset extending Manga109 with panel-level captions for 109 titles.Our programs and datasets are available from [anonymous link].

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, cross-modal application

Contribution Types: Data resources, Data analysis

Languages Studied: English

Submission Number: 7002

Loading