Abstract: Most existing methods for text-based person retrieval focus on text-to-image person retrieval. Nevertheless, due to the lack of dynamic information provided by isolated frames, the performance is hampered when the person is obscured or variable motion details are missed in isolated frames. To overcome this, we propose a novel Text-to-Video Person Retrieval (TVPR) task. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset containing detailed natural language annotations, such as person’s appearance, actions and interactions with environment, etc., termed as Text-to-Video Person Re-identification (TVPReid) dataset. In this paper, we introduce a Multielement Feature Guided Fragments Learning (MFGF) strategy, which leverages the cross-modal text-video representations to provide strong text-visual and text-motion matching information to tackle uncertain occlusion conflicting and variable motion details. Specifically, we establish two potential cross-modal spaces for text and video feature collaborative learning to progressively reduce the semantic difference between text and video. To evaluate the effectiveness of the proposed MFGF, extensive experiments have been conducted on TVPReid dataset. To the best of our knowledge, MFGF is the first successful attempt to use video for text-based person retrieval task and has achieved state-of-the-art performance on TVPReid dataset. The TVPReid dataset will be publicly available to benefit future research.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Engagement] Multimedia Search and Recommendation
Relevance To Conference: Text-image and text-video cross-modal matching has always been a hot spot in multimodal research. Current text-based person retrieval work focuses on text-to-image person retrieval. However, this method has drawbacks that directly affect the retrieval accuracy. In order to solve these drawbacks, we propose a new task: Text-to-Video Person Retrieval (TVPR),and a valid model. At the same time, since there is no ready-made dataset available, we build a large-scale cross-modal person video dataset containing detailed natural language descriptions, termed as Text-to-Video Person Re -identification (TVPReid) dataset, which lays the foundation for text-to-video pedestrian retrieval. In the model proposed in this paper, we gradually reduce the semantic differences between text and video through their potential interactions, and continuously deepen the model's understanding of visual and motion information through contrastive calibration in the potential space to optimize text and video representations. Overall, in this paper we strive to achieve the leap from text-image to text-video to open up a broader research space for text-based person retrieval, which will promote the development of multimodal research.
Supplementary Material: zip
Submission Number: 1996
Loading