ADOPD-Instruct: A Large-Scale Multimodal Dataset for Document Editing

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: document editing, multimodal dataset, empirical study
TL;DR: We introduce a large-scale multimodal dataset for visually-rich document editing.
Abstract: Visually-rich document editing is a complex multimodal task with a wide range of real-world applications. Despite increasing interest, there is a significant lack of publicly available datasets offering detailed entity-level annotations and step-by-step instructions for the editing process. To address this, we introduce ADOPD-Instruct, a multimodal dataset designed specifically for document editing tasks. ADOPD-Instruct includes visually-rich documents, precise entity-level masks highlighting elements to be edited, and step-by-step edit instructions, targeting both the masking and inpainting processes for text and non-text design elements. ADOPD-Instruct instructions have been carefully curated by human annotators to ensure high quality across the dataset. We conduct extensive evaluations of current Multimodal Large Language Models (MLLMs) and image editing models using various image backbones to assess their performance on document editing. The results reveal substantial challenges: current MLLMs struggle to generate accurate and detailed instructions, while image editing models often fail to follow instructions precisely, particularly with text edits. These findings underscore the limitations of existing models and highlight the importance of annotated datasets like ADOPD-Instruct for advancing this domain. Dataset is available at: https://huggingface.co/datasets/adopd-instruct/ADOPD-Instruct.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9379
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview