Lost in Translation: A Position Paper on Probing Cultural Bias in Vision-Language Models via Hanbok VQA
Keywords: Vision-Language Models, Visual Question Answering, Cultural Bias
Abstract: While Vision-Language Models (VLMs) offer transformative potential for cultural heritage preservation, they often exhibit significant ``cultural blind spots'' due to training data heavily skewed towards Western contexts. This leads to limited understanding of non-Western cultures, such as those from East Asia. This paper posits that modern VLMs consequently fail to accurately interpret underrepresented cultural objects, leading to misidentification, cultural confusion, and factual hallucination. To investigate this, we evaluate prominent VLMs including LLaVA-1.5, ViP-LLaVA, Shikra, and MiniGPT-4 on a newly curated, culturally-rich Visual Question Answering (VQA) dataset specifically focused on traditional Korean attire, Hanbok. Our experimental results demonstrate that these models not only exhibit low accuracy but also reveal systematic error patterns indicative of a deeper lack of cultural understanding. Beyond diagnosing this deficiency, we propose a methodological refinement through the adoption of `thick' evaluation frameworks that move beyond superficial accuracy metrics, explicitly assessing nuanced cultural understanding and alignment. Furthermore, we propose Multimodal Retrieval-Augmented Generation (MRAG) as an enhanced architectural paradigm to ground models explicitly in verifiable, culturally contextualized, and community-curated knowledge, addressing fundamental shortcomings of existing methods. This work provides empirical evidence of cultural limitations inherent in current VLMs and charts a research agenda toward building more equitable and culturally respectful AI for global digital heritage.
Submission Number: 10
Loading