Seeing Culture: A Benchmark for Visual Reasoning and Grounding

ACL ARR 2025 May Submission7665 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multimodal vision-language models (VLMs) have achieved substantial progress in various tasks that demand combined understanding of visual and textual content, particularly in cultural understanding tasks, with the emergence of new cultural datasets. However, these datasets frequently fall short of providing cultural reasoning while underrepresenting many cultures. In this paper, we introduce the Seeing Culture Benchmark (SCB), focusing on cultural reasoning with a novel approach that requires VLMs to reason on culturally rich images in two stages: i) selecting the correct visual option in a multiple-choice visual question answering (VQA) approach, and ii) segmenting the relevant cultural artifact as evidence of reasoning. Visual options in the first stage are systematically organized into three types: those originating from the same country, those from different countries, or a mixed group. Notably, all options are derived from a singular category for each type. Progression to the second stage occurs only after a correct visual option is chosen. Our benchmark encompasses 1,065 images capturing 138 cultural artifacts across five categories from seven Southeast Asia (SEA) countries, whose diverse cultures are often overlooked. Additionally, the benchmark provides 3,178 questions, of which 1,093 are unique and meticulously curated by human annotators. Our evaluation of various VLMs reveals the complexities involved in cross-modal cultural reasoning and underscores the disparity between visual reasoning and spatial grounding in scenarios that are culturally nuanced. The SCB serves as a crucial benchmark for identifying these shortcomings so as to guide future developments in the field of cultural reasoning.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: multimodal QA, vision question answering, cross-modal application, benchmarking
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 7665
Loading