IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation

ICLR 2026 Conference Submission19636 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Language Models; Computer Vision;Machine Learning
TL;DR: A plug-in module that enhances interleaved image-text generation inVLMs to address compositional fragility and contextual drift.
Abstract: Existing vision language models (VLMs), including GPT-4 and DALL·E, often struggle to preserve logic, object identity, and style in multimodal image-text generation. This limitation significantly hinders the generalization capability of VLMs in complex image-text input-output scenarios. To address this issue, we propose $\textbf{IUT-Plug}$, a plug-in module grounded in an $\textit{Image Understanding Tree}$ (IUT), which enhances existing interleaved VLMs through explicit structured reasoning, thereby mitigating context drift in logic, entity identity, and style. The proposed framework operates in two stages. (1) A dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures. (2) A coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency. To evaluate our approach, we construct a novel benchmark based on 3,000 real human-generated question-answer pairs over fine-tuned large models, introducing a dynamic evaluation protocol for quantifying context drift in interleaved VLMs. Experimental results demonstrate that IUT-Plug not only improves accuracy on established benchmarks but also effectively alleviates the three critical forms of context drift across diverse multimodal question answering (QA) scenarios.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19636
Loading