Abstract: Highlights•TeachText leverages the additional information brought by the usage of multiple text embeddings.•We propose learning the retrieval similarity matrix between joint query-video embeddings.•We achieve significant gains across six text-video retrieval benchmarks.•We improve the CE+ architecture with GPT-J embeddings, boosting performance.•A thorough error analysis highlights the benefits of multiple text embeddings in text-video retrieval.
Loading