VideoNorms: Benchmarking Socio-Cultural Norm Understanding of Video Language Models

VideoNorms: Benchmarking Socio-Cultural Norm Understanding of Video Language Models

ACL ARR 2026 January Submission10612 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: cultural norms, social norms, video understanding, video llm

Abstract: As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. We introduce VideoNorms, the first benchmark to assess cultural norm competence of VideoLLMs, consisting of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms, adherence and violations labels, and verbal and non-verbal evidence. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models perform worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training — a gap our benchmark and framework begin to address.

Paper Type: Long

Research Area: Computational Social Science, Cultural Analytics, and NLP for Social Good

Research Area Keywords: human behavior analysis, language/cultural bias analysis,sociolinguistics,NLP tools for social analysis

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis

Languages Studied: English, Mandarin

Submission Number: 10612

Loading