VideoNorms: Benchmarking Socio-Cultural Norm Understanding of Video Language Models

ACL ARR 2026 January Submission10612 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: cultural norms, social norms, video understanding, video llm
Abstract: As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. We introduce VideoNorms, the first benchmark to assess cultural norm competence of VideoLLMs, consisting of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms, adherence and violations labels, and verbal and non-verbal evidence. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models perform worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training — a gap our benchmark and framework begin to address.
Paper Type: Long
Research Area: Computational Social Science, Cultural Analytics, and NLP for Social Good
Research Area Keywords: human behavior analysis, language/cultural bias analysis,sociolinguistics,NLP tools for social analysis
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English, Mandarin
Submission Number: 10612
Loading