Keywords: sociocultural-awareness; dialogue generation
Abstract: Multimodal large language models (MLLMs) have revolutionized many applications but still face challenges related to cultural bias and a lack of cultural commonsense knowledge crucial for guiding cross-culture communication and interactions. In particular, prior studies in the cultural domain largely overlook the fine-grained situational context reflecting the diverse and rich cultures across the world. To bridge this gap, we introduce a novel approach for massively multicultural MLLM knowledge acquisition at the fine-grained social awareness level. First, we construct a novel dataset, NormLens, for benchmarking sociocultural norm-aware reasoning in the underlying LLM backbones, by extracting and curating 42,000 culturally grounded assertions from Wikipedia, spanning 1,000+ sub-country regions and 2,000+ ethnolinguistic groups, with automated cleaning for self-contained sentences and fine-grained cultural profile extraction. Building on this, we propose a novel framework for multimodal cultural knowledge acquisition, MM-ACE (Multi-Modal Alignment with Cultural Enhancement), via scalable finetuning on contrastive (norm, dialogue, image) triplets. Experiments demonstrate that MM-ACE improves cultural norm violation detection by 7.5% F-score over baselines, with particularly strong gains on fine-grained situational understanding tasks in our manually curated gold standard test set.
Submission Number: 23
Loading