SynDoc: A Hybrid Discriminative-Generative Framework for Synthetic Domain-Adaptive Document Key Information Extraction

ICLR 2026 Conference Submission15443 Authors

19 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Document Understanding, Multimodal Large Language Model, Synthetic Data, Key Information Extraction
Abstract: Domain-specific Visually Rich Document Understanding (VRDU) presents significant challenges due to the complexity and sensitivity of documents in fields such as medicine, finance, and material science. Although existing Large (Multimodal) Language Models (LLMs/MLLMs) achieve promising results, they still suffer from hallucinations, limited domain adaptation, and heavy reliance on extensive fine-tuning datasets. We introduce SynDoc, a novel framework that combines discriminative and generative models to address these challenges. SynDoc features a robust synthetic data generation workflow that extracts structural information and generates domain-specific queries to produce high-quality annotations. Through adaptive instruction tuning, SynDoc improves the discriminative model's ability to extract domain-specific knowledge. In parallel, a recursive inference mechanism iteratively refines outputs from both models to achieve stable and accurate predictions. This integrated framework demonstrates scalable, efficient, and precise document understanding, bridging the gap between domain-specific adaptation and general world knowledge for key information extraction tasks.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15443
Loading