Abstract: Existing video retrieval datasets suffer heavily from the ignorance of cross-matching between captions and videos. Typically, captions actually match with multiple videos but are incorrectly labeled as exclusive to ones, leading to numerous incorrectly mismatched data. Furthermore, such ignorance may hinder model performance and flaw the evaluation of video retrieval. To alleviate this problem, we develop a training-free annotation pipeline, Bootstrapping Video Retrieval via Cross-matching Correction (BVRCC), which leverages the powerful ChatGPT and Multimodal Large Language Models (MLLMs) to correct these incorrectly mismatched data. We conducted experiments on the video retrieval datasets MSRVTT and MSVD. First, we corrected the MSRVTT test set and utilized the entire MSVD benchmark to facilitate cross-matching-based training and evaluation. Then, We re-train and re-evaluate video retrieval models on the corrected datasets, demonstrating the real performance of video retrieval models. Moreover, to fully use the corrected training data, we integrated a Cross-matching-based Learning Rate Strategy (CLRS) into video retrieval models, achieving a 2.85 R@10 improvement on MSVD.
Loading