SoMORE: Social Context-Aware MLLM for Video Character Search

Published: 2025, Last Modified: 16 Jan 2026KSEM (2) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video character search is a fundamental yet challenging task in video understanding due to pose variations, occlusions, and style differences that reduce feature discriminability. Existing techniques are often constrained by traditional training and testing frameworks as well as homogeneous training data. These methods fail to capture the identity information embedded in the high-level semantics of complex or cross-domain, which means statistical distribution discrepancy between queries and training corpus, scenarios, leading to significant performance degradation. The emergence of multimodal large language models (MLLMs) has introduced a potential solution to this problem. Although MLLMs are not particularly adept at person retrieval due to differences in pretraining tasks, their powerful general semantic understanding and reasoning capabilities can aid in exploring identity information in complex social scenarios. Building on this, we propose a novel framework, Social MultimOdal RE-identification (SoMORE). Specifically, we guide the MLLM to extract visual features of individuals that are highly relevant to identity recognition. By leveraging the deep semantic understanding and reasoning capabilities of the MLLM, we then explore the relationships between individuals to provide richer prior knowledge. Ultimately, these relationships enable us to connect the query individual with the gallery individuals, aggregating identity cues from related gallery subjects to enhance the robustness of person features. Experimental results demonstrate that, compared to traditional models, SoMORE not only excels in complex scenarios but also achieves state-of-the-art (SOTA) performance on the defined cross-domain problem.
Loading