Decomposing Visual Histories with Vision-Language Agents: Hierarchical Temporal Guidance for Compositional Image Generation

ACL ARR 2026 May Submission15963 Authors

26 May 2026 (modified: 21 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-reference conditioning, diffusion models, temporal routing, hierarchical guidance
Abstract: Reference-conditioned image generation increasingly relies on visual histories that rarely speak with one voice: one example pins the subject, another fixes a layout, a third only suggests a palette. Current image-conditioned diffusion methods encode these references once, average them into a single vector, and inject it at every denoising step, so conflicting cues collide and the output is faithful to none. We propose the Hierarchical Temporal Guidance Framework (HTGF), a training-free pipeline that reframes multi-reference conditioning as a \emph{temporal routing} problem. A semantic decomposer reads each reference and emits a soft routing distribution over three axes (Subject, Structure, Detail), and a closed-form SNR-sensitivity argument places each axis in the denoising window where it has the most leverage. A short manifold-aware corrector then smooths the trajectory at the stage boundaries where the active condition changes. HTGF adds no training and no architectural change to the diffusion backbone. On three datasets it outperforms strong VLM- and diffusion-based baselines, gaining $+11.7$ CIS over LaVIT on Movie Poster and FID $38.87$ vs.\ $50.12$ for the best baseline under noisy histories, and degrading gracefully down to the zero-reference limit.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Personalized image generation, MLLM
Contribution Types: NLP engineering experiment
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 15963
Loading