Keywords: video question answering, social reasoning, theory of mind, causal chains, read the room
Abstract: "Read the room," or the ability to infer others’ mental states from subtle social cues, is a hallmark of human social intelligence but remains a major challenge for current AI systems. Existing social reasoning datasets are limited in complexity, scale, and coverage of mental states, falling short of the rich causal dynamics found in real-life interactions. In this work, we introduce Read-the-Room Reasoning for Video Question Answering (R3-VQA), a high-quality and comprehensive video dataset designed to advance social reasoning in large vision-language models (LVLMs). It comprises two parts: R3-VQA-Challenge, an evaluation benchmark with fine-grained annotations of belief, intent, desire, emotion, and their causal chains in complex scenarios; and R3-VQA-Training, a large-scale training set generated through a novel automated pipeline with the same structure. We conduct a comprehensive evaluation of SOTA LVLMs on R3-VQA-Challenge, revealing substantial gaps in consistent multi-step social reasoning. We also fine-tune a 7B model using group relative policy optimization (GRPO) on R3-VQA-Training, achieving notable improvements across multiple social reasoning benchmarks. Our contributions are three-fold: (i) a novel benchmark with richly annotated, multi-step causal reasoning data; (ii) systematic evidence that SOTA LVLMs fall far short of human-level reasoning; (iii) a scalable training dataset that significantly enhances social reasoning performance.
Supplementary Material: pdf
Primary Area: applications to neuroscience & cognitive science
Submission Number: 15054
Loading