A Multi-Granularity Relation Graph Aggregation Framework With Multimodal Clues for Social Relation Reasoning
Abstract: The social relation is a fundamental attribute of human beings in daily life. The ability of humans to form large organizations and institutions stems directly from our complex social networks. Therefore, understanding social relationships in the context of multimedia is crucial for building domain-specific or general artificial intelligence systems. The key to reason social relations lies in understanding the human interactions between individuals through multimodal representations such as action and utterance. However, due to video editing techniques and various narrative sequences in videos, two individuals with social relationships may not appear together in the same frame or clip. Additionally, social relations may manifest in different levels of granularity in video expressions. Previous research has not effectively addressed these challenges. Therefore, this paper proposes a Multi-Granularity Relation Graph Aggregation Framework (MGRG) to enhance the inference ability for social relation reasoning in multimedia content, like video. Different from existing methods, our method considers the paradigm of jointly inferring the relations by constructing a social relation graph. We design a hierarchical multimodal relation graph illustrating the exchange of information between individuals' roles, capturing the complex interactions at multi-levels of granularity from fine to coarse. In MGRG, we propose two aggregation modules to cluster multimodal features in different granularity layer relation graph, considering temporal aspects and importance. Experimental results show that our method generates a logical and coherent social relation graph and improves the performance in accuracy.
External IDs:dblp:journals/tmm/XuCJWJLZZ25
Loading