InstructOCR2: a lightweight and efficient multi-modal language model for document understanding

InstructOCR2: a lightweight and efficient multi-modal language model for document understanding

ACL ARR 2025 February Submission5186 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In recent years, there has been significant interest in using Multi-modal Large Language Models (MLLMs) for OCR tasks, leading to the development of MLLMs specifically designed for the OCR domain. The majority of existing approaches focus on developing larger and more sophisticated models, which demands substantial computational resources for training and deployment. Furthermore, these methods often fail to achieve effective alignment between text and its corresponding positions within the image. Some approaches merely feed all text directly into the model, while others, despite incorporating coordinate information, still fall short in capturing the precise location and contextual relationships of text within images. In this paper, we propose a lightweight multi-modal language model called InstructOCR2, which achieves multi-scene and multi-task OCR recognition with fewer parameters. InstructOCR2 enhances the model's comprehension of global and local text through fine-grained alignment of text and images, thereby improving the performance of downstream tasks such as Visual Question Answering (VQA) and Key Information Extraction (KIE).

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: OCR,MLLMs,Text-rich Image understanding

Contribution Types: Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 5186

Loading