Track: tiny paper (up to 4 pages)
Keywords: Document intelligence, OCR, vision–language models, document parsing, visual question answering, benchmarking
TL;DR: OCR pipelines work best for long, text-heavy documents whilst VLMs excel at visually rich content like infographics, and we show this by evaluating text extraction and question answering separately rather than just measuring end-to-end accuracy.
Abstract: Document intelligence requires accurate text extraction as well as reliable reasoning over document content. We introduce \textbf{DISCO}, a \emph{Document Intelligence Suite for Comparative Evaluation}, which evaluates optical character recognition (OCR) pipelines and vision-language models (VLMs) separately on parsing and question answering across diverse document types, including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents. Our evaluation shows that performance varies substantially across tasks and document characteristics, underscoring the need for complexity-aware approach selection. OCR pipelines are generally more reliable for handwriting and for long or multi-page documents, where explicit text grounding supports text-heavy reasoning, while VLMs perform better on multilingual text and visually rich layouts. Task-aware prompting yields mixed effects, improving performance on some document types while degrading it on others. These findings provide empirical guidance for selecting document processing strategies based on document structure and reasoning demands.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 23
Loading