Re-identifying People in Video via Learned Temporal Attention and Multi-modal Foundation Models

Cole Hill, Florence Yellin, Krishna Regmi, Dawei Du, Scott McCloskey

Published: 26 Feb 2025, Last Modified: 12 Nov 2025WACVEveryoneCC BY-NC-SA 4.0

Abstract: Biometric recognition from security camera video is a challenging problem when the individuals change clothes or when they are partly occluded. Others have recently demonstrated that CLIP's visual encoder performs well in this domain, but existing methods fail to make use of the model's text encoder or temporal information available in video. In this paper, we present VCLIP, a method for person identification in videos captured in challenging poses and with changes to a person's clothing. Harnessing the power of pre-trained vision-language models, we Jointly train a temporal fusion network while fine-tuning the visual encoder. To leverage the cross-modal embedding space, we use learned biometric pedestrian attribute features to further enhance our model's person re-identification (Re-ID) ability. We demonstrate significant performance improvements via experiments with the MEVID and CCVID datasets, particularly in the more challenging clothes-changing conditions. In support of this and future methods that use textual attributes for Re-ID with multimodal models, we release a dataset of annotated pedestrian attributes for the popular MEVID dataset [4].