Keywords: Multi-modal Vision, Large Multimodal Models, Mining of Visual, Multimedia and Multimodal Data
Abstract: Layout design is a fundamental aspect of visual communication, widely used in advertising, publishing, and digital media. Recent datasets and methods, including content-agnostic and content-aware approaches, have advanced automatic layout generation, and large language models (LLMs) and multi-modal LLMs (MLLMs) have further improved performance. However, most existing methods focus on predicting bounding boxes for limited design elements on fixed backgrounds, which restricts their capability to tackle diverse instruction-driven tasks in real-world applications. To address these limitations, we introduce **AnyLayout-120K**, a large-scale instruction-driven dataset for multimodal layout generation. It offers: (1) *Task Diversity*—comprising four instruction-driven sub-tasks that encompass multimodal design elements such as multi-lingual text, visual/textual product, logos and background underlays; (2) *Rich Annotations*—including user instructions, multimodal inputs and spatial annotations; (3) *Downstream Compatibility*—where, in addition to the layout of individual elements, we propose composite layouts that capture the overall design, integrating both details and semantics. These composite layouts can be seamlessly incorporated into text-to-image (T2I) models for end-to-end generation. Alongside this dataset, we develop 7 geometry-aware evaluation metrics that assess spatial precision and adherence to design principles, ensuring a more comprehensive evaluation. Furthermore, utilizing this dataset, we establish a strong baseline based on MLLMs, achieving state-of-the-art performance. The dataset, metrics, and baseline will be released to support future research in instruction-driven layout design.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8785
Loading