Enhanced Visual Instruction Tuning for Text-Rich Image UnderstandingDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Instruction tuning enhances the capability of Large Language Models (LLMs) to interact with humans. Furthermore, recent instruction-following datasets include images as visual input, collecting responses for image-based instructions. However, current visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current open-source visual instruction tuning models with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first used publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Furthermore, we prompt text-only GPT-4 with recognized text and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. Using the above-collected data, we substantially improve (up to 20% accuracy improvement) the zero-shot capability of two open-source backbone models on seven datasets (text-based VQA, Information Extraction, ChartQA, etc.). The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. We will make our code/data/models publicly available.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Data resources
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview