Social-R1: Incentivizing Social Relation Reasoning Capability of Multimodal Large Language Models via Reinforcement Learning

ACL ARR 2025 May Submission3002 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Social relationship recognition, which infers relationship types between individuals, is crucial for the deep understanding of semantically rich multimodal scenarios, supporting a wide range of downstream applications. However, despite advances in classification accuracy achieved by end-to-end learning frameworks and knowledge-enhanced models, current approaches still face challenges in generalization, interpretability, and efficiency. In this paper, we introduce Social-R1, a multimodal large language model trained with reinforcement learning (RL) for social relationship recognition. Our approach enables end-to-end reasoning directly from images and bounding boxes, without requiring multi-stage pipelines or handcrafted prompts. Social-R1 achieves state-of-the-art performance on the PIPA and PISC benchmarks, while generating human-understandable rationales that significantly improve interpretability in social relationship recognition.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering,multimodality
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 3002
Loading