A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval
Abstract: Existing long video retrieval systems are trained and
tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph.
This neglects the richness and variety of possible valid descriptions of a video, which could range anywhere from
moment-by-moment detail to a single phrase summary. To
provide a more thorough evaluation of the capabilities of
long video retrieval systems, we propose a pipeline that
leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long
videos. We validate this pipeline’s fidelity via rigorous human inspection. We use synthetic captions from this pipeline
to perform a benchmark of a representative set of video language models using long video datasets, and show that the
models struggle on shorter captions. We show that finetuning on this data can both mitigate these issues (+2.8% R@1
over SOTA on ActivityNet with diverse captions), and even
improve performance on standard paragraph-to-video retrieval (+1.0% R@1 on ActivityNet). We also use synthetic
data from our pipeline as query expansion in the zero-shot
setting (+3.4% R@1 on ActivityNet). We derive insights by
analyzing failure cases for retrieval with short captions.
Loading