Compositional Preference Learning for Composed Person Retrieval

Published: 11 May 2026, Last Modified: 11 May 2026AERO-HPR 2026 PosterEveryoneRevisionsCC BY 4.0
Track: Non-Proceedings Track
Keywords: Aerial Surveillance, Composed Person Retrieval, Multimodal Compositional Reasoning
TL;DR: We improve composed person retrieval by modeling compositional preference, training the model to rank correct image–text compositions above cross-sample counterfactual variants.
Abstract: In aerial surveillance, person retrieval is challenging due to low resolution, large viewpoint changes, and appearance ambiguity, which limit appearance-only matching. These conditions motivate multimodal person retrieval settings in which identity cues from a reference image are combined with language-based descriptions of appearance change. We study this problem through composed person retrieval (CPR), where a target image is retrieved given a reference image and a modification text. We propose a simple framework that models compositional preference by constructing mismatched compositions via cross-sample replacement of either images or texts, and training the model to rank the original composition above these variants. This objective encourages the model to remain sensitive to changes in either the modification text or the reference identity, enabling finer-grained compositional reasoning. Our method achieves state-of-the-art performance on the ITCPR benchmark, surpassing the previous best supervised CPR baseline by 2.18% in Rank-1 and 1.80% in mAP. These results demonstrate that explicitly modeling compositional preference is an effective strategy for composed person retrieval and a promising direction for challenging surveillance scenarios, particularly aerial surveillance.
Submission Number: 13
Loading