See then Tell: Enhancing Key Information Extraction with Vision Grounding

ACL ARR 2025 May Submission3416 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In the digital era, understanding visually rich documents that combine text, complex layouts, and imagery is crucial. Traditional Key Information Extraction (KIE) methods rely on Optical Character Recognition (OCR), which often incurs latency, computational overhead, and errors. Recent image-to-text approaches bypass OCR but typically yield plain text outputs without vision grounding. In this paper, we introduce STNet ($\textbf{S}$ee then $\textbf{T}$ell Net), an end-to-end model that jointly produces accurate textual answers and their corresponding vision grounding. At the core of STNet lies a novel $\texttt{<see>}$ token, prepended to each response. During generation, $\texttt{<see>}$ directs the model first to $\mathit{see}$ — observing the regions of the image related to the input question (decoded into physical coordinates) — and then to $\mathit{tell}$, emitting the textual answer. To enhance the model's $\mathit{see}$ capabilities, we collect extensive structured table recognition datasets and leverage GPT-4 to develop TVG ($\textbf{T}$ableQA with $\textbf{V}$ision $\textbf{G}$rounding), a dataset of QA pairs annotated with vision grounding. Our approach demonstrates substantial advancements in KIE performance, achieving state-of-the-art results on publicly available datasets such as CORD, SROIE, and DocVQA. The code and dataset will be made publicly available.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: named entity recognition and relation extraction, zero/few-shot extraction
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English, Chinese
Submission Number: 3416
Loading