KoTextVQA: A Benchmark for Understanding and Reasoning in Korean Text-Rich Visual Question Answering
Abstract: In real-world scenarios, text in images conveys essential information, appearing in documents, everyday scenes, and digital displays.
Accurately interpreting text and its visual context is a key objective for Vision-Language Models (VLMs), driving advancements in text-rich Visual Question Answering (VQA) datasets and benchmarks. However, low-resource languages remain underexplored, lacking appropriate benchmarks for real-world applications. Without these benchmarks as milestones, systematic evaluation becomes difficult, slowing down iterative improvements in model performance and the refining of fine-tuning strategies. To address this, we introduce KoTextVQA, a Korean Text-rich VQA benchmark for comprehensive VLM evaluation. KoTextVQA enables a multifaceted assessment across diverse image types and domains, while also supporting in-depth analysis through comparisons between visual understanding (System 1) and reasoning (System2) ability. Additionally, we release an automated VQA generation pipeline that leverages well-trained resource-rich language models to efficiently construct benchmarks, facilitating scalable and high-quality dataset creation. While our benchmark is designed for Korean, the proposed methodology is highly adaptable and can be extended to other languages, enabling broader multilingual VLMs research.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, Multilingualism and Cross-Lingual NLP, Question Answering
Contribution Types: Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: Korean, English
Submission Number: 4485
Loading