DocKIE-Bench: Document Key Information Extraction Benchmark for Large Language Models

ACL ARR 2025 May Submission1824 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Document Key Information Extraction (KIE) transforms unstructured or semi-structured documents into structured data, typically key–value pairs or grouped entities, that support enterprise applications such as business workflow automation. While recent work explores the use of Large Language Models (LLMs) for KIE using prompting, rather than document-specific fine-tuning, progress is hindered by the lack of benchmarks tailored to this emerging paradigm. We introduce DocKIE-Bench, a benchmark specifically designed to evaluate KIE in the context of LLMs. DocKIE-Bench provides carefully designed schema with detailed descriptions, formats, and examples, covers 38 document types from diverse domains, and includes fine-grained component tags (tables, forms, handwritten regions, and others) that enable nuanced analysis of model performance. We evaluate both proprietary and open-source LLMs and conduct comprehensive ablation studies on schema design and input modality, offering practical insights into current strengths and limitations. The dataset will be publicly available.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: document key information extraction, zero-shot extraction
Contribution Types: Data resources
Languages Studied: English
Submission Number: 1824
Loading