Addressing Illiteracy of Vision-Language Model in Underrepresented Language Through Image-Text Mix Augmentation Scheme

Published: 01 Jan 2025, Last Modified: 06 Nov 2025AVSS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recently, open-source large Vision-Language Models (VLMs) have progressed toward achieving performance comparable to closed-source VLMs. However, open-source VLMs struggle to recognize unfamiliar texts depicted in the images, where the texts are written in underrepresented languages like Korean. This illiteracy problem is primarily due to insufficient training data for the underrepresented languages. To address this problem, we propose a novel augmentation scheme that generates large-scale image data for the underrepresented languages with minimal manual annotations. Our scheme synthetically combines a text image depicting words or sentences with a template image containing textual contexts, such as a receipt, a sign, a book, and a product label. Specifically, the text image is cut and pasted into a patch of the template image to generate a synthetic image, which is labeled with the corresponding texts in the text image. Therefore, fine-tuning a VLM with our synthetic data can enhance its ability to generalize to real-world text recognition tasks. Experimental results demonstrate the effectiveness of our scheme, showing a significant performance improvement in text recognition.
Loading