Distill the Knowledge of Multimodal Large Language Model into Text-to-Image Vehicle Re-identification

Jianshu Zeng, Chi Zhang

Published: 01 Jan 2025, Last Modified: 12 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: Text-to-Image Vehicle Re-identification(TIVReid) aims to retrieve the target vehicle image according to a given description. For this task, efficient feature alignment of image and text modalities is crucial yet constrained by the lack of large-scale, high-quality datasets. Recently, the multimodal Large Language Model(MLLM) has shown remarkable performance in image-text understanding, which motivates this paper to explore the application of MLLM in TIVReid. We propose an effective method to distill the knowledge of MLLM into the TIVReid model with the following innovations: Firstly, we propose a prompt design approach that introduces the attribute-guided pre-prompt and optimized few-shot policy to guide MLLM to generate high-quality descriptions. Secondly, we devise a two-stage aligning strategy to better utilize the generated data. We relax the alignment on the non-target domain(generated data) in stage-1 and then enhance it on the target domain in stage-2. Finally, sufficient experiments have demonstrated the effectiveness of our method and that the generated data are comparable to or even superior to human-annotated data. Our method achieves significant improvement by 6.7%, 7.6%, and 4.9% in Rank-1, Rank-5, and mAP respectively, compared to the SOTA model on the T2I-VeRi dataset. Code and dataset will be open-sourced at https://github.com/Fly-ShuAI/TIVR2.
Loading