Abstract: Recently, audiences have increasingly watched diverse dramas and movies on streaming platforms, prompting platform administrators to enhance their understanding of video semantics. In particular, capturing and interpreting social relations among characters is critical for content-driven intelligent services and enhancing user experience. However, most existing research has solely approached social relations recognition as a classification problem, without justifiable interpretation of prediction results within the video context. To address this issue, we study cognitive science research and the Chain-of-Thought (CoT) strategy. Based on these foundations, we propose SaMo-CoT, an approach that leverages large language models (LLMs) in step-by-step reasoning. This approach simulates human cognitive processes by combining social scenarios in videos and incorporating empirical social interaction knowledge. In this way, we enable verbal rationales for determining social relations in video understanding. Furthermore, we present an innovative doubly-right social relation recognition framework that predicts both correct social relation labels and correct scenario rationales. Specifically, we translate verbal CoT into multimodal CoT by leveraging scenario-aware prompts and contrastive learning. Extensive experiments demonstrate significant improvements in classification accuracy and interpretability compared to traditional approaches.
External IDs:doi:10.1109/tcsvt.2025.3571422
Loading