TaiwanVQA: Benchmarking and Enhancing Cultural Understanding in Vision-Language Models

Hsin-Yi Hsieh; Shang-Wei Liu; Chang Chih Meng; Chien-Hua Chen; Shuo-Yueh Lin; Hung-Ju Lin; Hen-Hsen Huang; I-Chen Wu

TaiwanVQA: Benchmarking and Enhancing Cultural Understanding in Vision-Language Models

Hsin-Yi Hsieh, Shang-Wei Liu, Chang Chih Meng, Chien-Hua Chen, Shuo-Yueh Lin, Hung-Ju Lin, Hen-Hsen Huang, I-Chen Wu

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY-SA 4.0

Keywords: Multimodality and Language Grounding to Vision, Resources and Evaluation, Question Answering

Abstract: Vision-language models (VLMs) often struggle with culturally specific content — a challenge largely overlooked by existing benchmarks that focus on dominant languages and globalized datasets. We introduce TᴀɪᴡᴀɴVQA, a VQA benchmark designed for Taiwanese culture to evaluate recognition and reasoning in regional contexts. TᴀɪᴡᴀɴVQA contains 2,736 images and 5,472 manually curated questions covering topics such as traditional foods, public signs, festivals, and landmarks. The official benchmark set includes 1,000 images and 2,000 questions for systematic assessment, with the remainder of the data used as training material. Evaluations on state-of-the-art VLMs reveal strong visual recognition but notable weaknesses in cultural reasoning. To address this, we propose a data augmentation strategy that combines human-annotated and synthesized dialogues to enhance cultural understanding. Fine-tuning yields significant gains on TᴀɪᴡᴀɴVQA while maintaining stable performance on other multimodal tasks. To further explore the models’ cultural understanding, we conducted an open-ended question answering experiment. The results indicate a notable decline in cultural knowledge generation ($\approx$10–20\%), suggesting challenges remain. TᴀɪᴡᴀɴVQA offers a scalable framework for building culturally grounded AI models in low-resource cultures, promoting diversity and fairness in multimodal AI. Our dataset and code are publicly available on [Hugging Face](https://huggingface.co/datasets/hhhuang/TaiwanVQA) and [GitHub](https://github.com/hhhuang/TaiwanVQA).

Croissant File: json

Dataset URL: https://huggingface.co/datasets/hhhuang/TaiwanVQA

Code URL: https://github.com/hhhuang/TaiwanVQA

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 1740

Loading