DIG: Complex Layout Document Image Generation with Authentic-looking Text for Enhancing Layout Analysis

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Even though significant progress has been made in standardizing document layout analysis, complex layout documents like magazines, newspapers, and posters still present challenges. Models trained on standardized documents struggle with these complexities, and the high cost of annotating such documents limits dataset availability. To address this, we propose the Complex Layout Document Image Generation (DIG) model, which can generate diverse document images with complex layouts and authentic-looking text, aiding in layout analysis model training. Concretely, we first pretrain DIG on a large-scale document dataset with a text-sensitive loss function to address the issue of unreal generation of text regions. Then, we fine-tune it with a small number of documents with complex layouts to generate new images with the same layout. Additionally, we use a layout generation model to create new layouts, enhancing data diversity. Finally, we design a box-wise quality scoring function to filter out low-quality regions during layout analysis model training to enhance the effectiveness of using the generated images. Experimental results on the DSSE-200 and PRImA datasets show when incorporating generated images from DIG, the mAP of the layout analysis model is improved from 47.05 to 56.07 and from 53.80 to 62.26, respectively, which is a 19.17% and 15.72% enhancement compared to the baseline.
Primary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: This work significantly contributes to multimedia processing through its innovative approach to addressing the challenges associated with complex layout documents. Documents, as a form of multimedia, integrate various media types such as text, images, tables and graphics to convey information effectively. By introducing the Complex Layout Document Image Generation model (DIG), we provide a tailored solution to generating diverse document images with complex layout and authentic-looking text. DIG bridges the gap between document complexity and difficulty in obtaining training data, and enhances layout analysis performance on complex documents with limited training data. This is particularly crucial for multimedia documents like magazines, newspapers, and posters, which often feature a blend of textual and visual elements and present challenges in annotation. Consequently, our method not only addresses the challenges of complex layout documents but also holds significant importance for advancing the extraction and understanding of multimedia information within documents for subsequent applications.
Supplementary Material: zip
Submission Number: 5019
Loading