BVRCC: Bootstrapping Video Retrieval via Cross-Matching Correction

Published: 01 Jan 2024, Last Modified: 05 Mar 2025ICANN (6) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Existing video retrieval datasets suffer heavily from the ignorance of cross-matching between captions and videos. Typically, captions actually match with multiple videos but are incorrectly labeled as exclusive to ones, leading to numerous incorrectly mismatched data. Furthermore, such ignorance may hinder model performance and flaw the evaluation of video retrieval. To alleviate this problem, we develop a training-free annotation pipeline, Bootstrapping Video Retrieval via Cross-matching Correction (BVRCC), which leverages the powerful ChatGPT and Multimodal Large Language Models (MLLMs) to correct these incorrectly mismatched data. We conducted experiments on the video retrieval datasets MSRVTT and MSVD. First, we corrected the MSRVTT test set and utilized the entire MSVD benchmark to facilitate cross-matching-based training and evaluation. Then, We re-train and re-evaluate video retrieval models on the corrected datasets, demonstrating the real performance of video retrieval models. Moreover, to fully use the corrected training data, we integrated a Cross-matching-based Learning Rate Strategy (CLRS) into video retrieval models, achieving a 2.85 R@10 improvement on MSVD.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview