MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing

ICLR 2026 Conference Submission22439 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-layer Document Editing, Reasoning-based Document Editing, Multimodal Agent, Benchmark
Abstract: Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the Multi-Layer Document Editing Benchmark (MiLDEBench), a human-in-the-loop corpus of over 20K design documents paired with diverse editing instructions. The benchmark covers both content and layout edits and is complemented by a task-specific evaluation protocol, MiLDEEval, which spans four dimensions for content editing (instruction following, layout consistency, aesthetics, and text rendering) and two dimensions for layout editing (instruction following and content consistency). Extensive experiments on 14 open-source and 2 closed-source models reveal that existing approaches fail to generalize: open-source models often cannot complete multi-layer document editing tasks, while closed-source models suffer from instruction misalignment and format violations. In contrast, MiLDEAgent achieves strong layer-aware reasoning and precise editing, delivering over 50% improvements compared to all open-source baselines and attaining performance comparable to closed-source models, thereby establishing the first strong baseline for multi-layer document editing.
Primary Area: datasets and benchmarks
Submission Number: 22439
Loading