Read the Room: Video Social Reasoning with Mental-Physical Causal Chains

Published: 26 Jan 2026, Last Modified: 06 Mar 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: video question answering, social reasoning, theory of mind, causal chains, read the room
Abstract: "Read the room", or the ability to infer others' mental states from subtle social cues, is a hallmark of human social intelligence, but remains a major challenge for current AI systems. Existing social reasoning datasets are limited in complexity, scale, and coverage of mental states, falling short of the rich causal dynamics found in real-life interactions. In this work, we introduce R$^3$-Bench, an evaluation benchmark with fine-grained annotations of belief, intent, desire, emotion, and their causal chains in complex scenarios. Furthermore, we introduce R$^3$-FDT, a large-scale training set generated through a novel automated pipeline with the same chain structure. We conduct a comprehensive evaluation of state-of-the-art (SOTA) large vision-language models (LVLMs) on R$^3$-Bench, revealing substantial deficiencies in consistent multi-step social reasoning. We also fine-tune a 7B model on R$^3$-FDT, achieving notable improvements across multiple relevant benchmarks. Our contributions are three-fold: (i) a novel benchmark with richly annotated, multi-step causal reasoning data; (ii) systematic evidence that SOTA LVLMs fall far short of human-level reasoning; (iii) a scalable training dataset that significantly enhances social reasoning performance. The datasets and code are available at: <https://github.com/LiXingNiu/Read-the-Room.git>.
Supplementary Material: pdf
Primary Area: applications to neuroscience & cognitive science
Submission Number: 15054
Loading