Read the Room: Video Social Reasoning with Mental-Physical Causal Chains

ICLR 2026 Conference Submission15054 Authors

19 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: video question answering, social reasoning, theory of mind, causal chains, read the room
Abstract: ``Read the room,'' or the ability to infer others' mental states from subtle social cues, is a hallmark of human social intelligence but remains a major challenge for current AI systems. Existing social reasoning datasets are limited in complexity, scale, and coverage of mental states, falling short of the rich causal dynamics found in real-life interactions. In this work, we introduce $R^3$-Bench-an evaluation benchmark with fine-grained annotations of belief, intent, desire, emotion, and their causal chains in complex scenarios; and $R^3$-FDT, a large-scale training set generated through a novel automated pipeline with the same structure. We conduct a comprehensive evaluation of state-of-the-art (SOTA) large vision-language models (LVLMs) on $R^3$-Bench, revealing substantial gaps in consistent multi-step social reasoning. We also fine-tune a 7B model using group relative policy optimization (GRPO) on $R^3$-FDT, achieving notable improvements across multiple social reasoning benchmarks. Our contributions are three-fold: (i) a novel benchmark with richly annotated, multi-step causal reasoning data; (ii) systematic evidence that SOTA LVLMs fall far short of human-level reasoning; (iii) a scalable training dataset that significantly enhances social reasoning performance. We will release our dataset, code and models upon acceptance.
Supplementary Material: pdf
Primary Area: applications to neuroscience & cognitive science
Submission Number: 15054
Loading