Keywords: Document Understanding, Multi-modal Models, Language Models, NLP, Multimodal Data, Key Information Extraction, Question Answering, Information Extraction, Table Comprehension, KIE, NLI, Visual QA, Layout-aware Language Models
TL;DR: Description of a benchmark spanning multiple end-to-end tasks related to understanding multi-modal documents with complex layouts.
Abstract: Understanding documents with rich layouts plays a vital role in digitization and hyper-automation but remains a challenging topic in the NLP research community. Additionally, the lack of a commonly accepted benchmark made it difficult to quantify progress in the domain. To empower research in this field, we introduce the Document Understanding Evaluation (DUE) benchmark consisting of both available and reformulated datasets to measure the end-to-end capabilities of systems in real-world scenarios. The benchmark includes Visual Question Answering, Key Information Extraction, and Machine Reading Comprehension tasks over various document domains and layouts featuring tables, graphs, lists, and infographics. In addition, the current study reports systematic baselines and analyzes challenges in currently available datasets using recent advances in layout-aware language modeling. We open both the benchmarks and reference implementations and make them available at https://duebenchmark.com and https://github.com/due-benchmark.
Supplementary Material: pdf
Contribution Process Agreement: Yes
Dataset Url: https://duebenchmark.com
License: MIT License
Author Statement: Yes