LayoutEnc: Leveraging Enhanced Layout Representations for Transformer-based Complex Scene Synthesis

Xiao Cui, Qi Sun, Min Wang, Li Li, Wengang Zhou, Houqiang Li

Published: 30 Apr 2025, Last Modified: 07 Nov 2025ACM Transactions on Multimedia Computing, Communications, and ApplicationsEveryoneRevisionsCC BY-SA 4.0

Abstract: In complex scene synthesis, the effective representation of layouts is paramount. This paper introduces LayoutEnc, an advanced approach specifically designed to enhance layout representation by improving interpretability, robustness, and expressiveness, thereby facilitating more efficient image transformation. Distinct from conventional approaches that homogenize layout and image data, LayoutEnc distinctively processes various data modalities, enhancing the fidelity and interpretability of the layout representation. We apply stochastic noise injection to image tokens to align training and inference conditions, thereby fortifying the robustness of the layout representation. Additionally, LayoutEnc employs a two-stage multi-scale guidance learning strategy, to meticulously extract and refine semantic and textural features from training images. This enriched layout representation is then adeptly integrated into a transformer-based image generation framework, facilitating controlled and nuanced scene synthesis. Experimental results on the COCO-stuff and Visual Genome datasets demonstrate that LayoutEnc outperforms prior works in metrics such as FID and Scene-FID scores. The code and demo are available on https://github.com/qsun1/LayoutEnc.

External IDs:doi:10.1145/3716389