Social Human Robot Embodied Conversation (SHREC) Dataset: Benchmarking Foundational Models’ Social Reasoning

Social Human Robot Embodied Conversation (SHREC) Dataset: Benchmarking Foundational Models’ Social Reasoning

ICLR 2026 Conference Submission13226 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Social Intelligence, Large Language Models, Foundational Models, Vision-Language Models, Robotics, Social Robots, Social Interactions

TL;DR: We present SHREC, a large-scale benchmark of real-world human–robot interactions designed to evaluate and advance AI models’ social reasoning.

Abstract: Our work focuses on the social reasoning capabilities of foundational models for real-world human–robot interactions. We introduce the Social Human Robot Embodied Conversation (SHREC) Dataset, a large-scale benchmark of 400 real-world human-robot interaction videos and over 10K annotations, capturing robot social errors, competencies, underlying rationales, and corrections. Unlike prior datasets focused on human–human interactions, the SHREC Dataset uniquely highlights the challenges faced by real-world embodied social AI agents, where robots lack innate social abilities such as emotion understanding, intention tracking, and conversational mechanics. Moreover, current foundational models struggle to recognize these deficits, which manifest as subtle, socially situated failures. To evaluate AI models’ capacity for social reasoning, we define eight benchmark tasks targeting critical areas such as (1) detection of social errors and competencies, (2) identification of underlying social attributes, (3) comprehension of interaction flow, and (4) providing rationale and alternative correct actions. Experiments with state-of-the-art foundational models, alongside human evaluations, reveal substantial performance gaps—underscoring the difficulty and providing directions in developing socially intelligent AI.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 13226

Loading