Synthetic Document Generation for Form Understanding with Vision-Language Models

Published: 15 Oct 2025, Last Modified: 31 Oct 2025BNAIC/BeNeLearn 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Track: Type A (Regular Papers)
Keywords: Synthetic Data, Vision Language Models, Optical Character Recognition, Form Understanding
Abstract: Form understanding remains a persistent challenge for document processing due to the scarcity of annotated training data and the sensitive nature of real-world forms. Vision–Language Models (VLMs) have shown promise in addressing the structural complexity of forms, but their performance is constrained by limited datasets. To address this gap, we propose a Synthetic Document Generation (SDG) pipeline that transforms empty templates into realistic, filled documents through a four-stage process: region detection, content generation, statistical placement modeling, and controlled degradation. The pipeline leverages YOLO-based region detection, large language model–driven content generation, and degradation frameworks such as Augraphy and Albumentations to produce high-fidelity, semantically coherent training data. We evaluate the effectiveness of synthetic data in isolation and in combination with real corpora by fine-tuning the olmOCR VLM on three regimes: real-only, synthetic-only, and hybrid datasets. Results show that synthetic data alone achieves competitive performance, but hybrid training consistently outperforms both real-only and synthetic-only setups, yielding the lowest error rates and highest semantic similarity scores. Comparisons against Tesseract and PP-OCR further demonstrate the advantage of VLMs trained with SDG-augmented data for structured form understanding. These findings confirm synthetic data as a scalable and practical supplement to real datasets, enhancing robustness while reducing reliance on real datasets.
Serve As Reviewer: ~Noah_Scharrenberg1
Submission Number: 37
Loading