Abstract: Retrieving the target vehicle through natural language descriptions plays a crucial role in intelligent transportation systems. Existing methods tackle this task by employing models that leverage the correlation between textual and visual representations, such as CLIP. However, these models struggle to capture the temporal characteristics of video data, and researchers enhance temporal understanding performance through various data augmentation and video encoders. Yet, conventional approaches in previous studies often overlook the detailed temporal characteristics of vehicles. To overcome this limitation, we introduce a MOVES: Motion-Oriented VidEo Sampling method to effectively utilize the motion information of the target vehicle. Furthermore, we construct a robust model by implementing a re-ranking algorithm to address a variety of vehicle attributes. As a result, our proposed model achieves state-of-the-art performance on the public vehicle retrieval dataset.
Loading