TeachText: CrossModal text-video retrieval through generalized distillation

Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Yang Liu, Samuel Albanie

Published: 2025, Last Modified: 16 Jan 2026Artif. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•TeachText leverages the additional information brought by the usage of multiple text embeddings.•We propose learning the retrieval similarity matrix between joint query-video embeddings.•We achieve significant gains across six text-video retrieval benchmarks.•We improve the CE+ architecture with GPT-J embeddings, boosting performance.•A thorough error analysis highlights the benefits of multiple text embeddings in text-video retrieval.