TeachText: CrossModal text-video retrieval through generalized distillation

Published: 01 Jan 2025, Last Modified: 21 Feb 2025Artif. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•TeachText leverages the additional information brought by the usage of multiple text embeddings.•We propose learning the retrieval similarity matrix between joint query-video embeddings.•We achieve significant gains across six text-video retrieval benchmarks.•We improve the CE+ architecture with GPT-J embeddings, boosting performance.•A thorough error analysis highlights the benefits of multiple text embeddings in text-video retrieval.
Loading