DOLMA: Visual Instruction Tuning for Document AI

ACL ARR 2024 June Submission5586 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The rapid expansion of Vision-Language Models (VLMs) has spurred research into their applicability across various domains. While VLMs excel in understanding environmental contexts, their effectiveness declines with visually-rich scanned documents. Although some VLMs use Optical Character Recognition (OCR) to mitigate this, OCR alone is insufficient for the complex textual and visual insights required. Developing tailored models for Document AI applications also demands substantial labeled data and high training costs. To address these challenges, we conducted experiments with various models, data types, architectures, and training methodologies. Based on our findings, we introduce DOLMA, an OCR-free vision-language model designed for diverse Document AI applications in a zero-shot setting. Despite having a moderate parameter count of 7 billion, DOLMA performs on par with models ten times larger on numerous Document AI benchmarks. The complete model, including weights, training data, and code, is publicly available.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Document AI, VRDU, KIE, LLM, OCR, vision-and-langauge, multimodal
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 5586
Loading