Abstract: Text-based video retrieval is a crucial technology for video and multimodal applications. Although in traditional Text-Video Retrieval caption-video pairs are supposed to be entirely relevant, there is still information missing in text when compared to the video content. In a specific application scenario of Text-Video Retrieval, where the given caption corresponds to only a segment of the target video, the challenge of aligning two modalities becomes particularly difficult. To address this issue, we introduce context information as an auxiliary to enrich text representation and enhance alignment. In this work, we propose an effective Linguistic Hallucination framework, which incorporates context captions during training and replaces them with hallucinated textual representations predicted from the source sentence at inference. Specific hallucination loss and consistency loss are designed to supervise the learning process. Besides, Curriculum Learning is introduced at both data-level and model-level, which makes the training procedure more stable and improves the retrieval performance simultaneously. Extensive comparison experiments and ablation studies on benchmark datasets demonstrate the effectiveness of our framework. Moreover, we also apply our proposed method to other cross-modal tasks and the promising experimental results prove its generalization ability. Our codes and datasets are available in https://github.com/silenceFS/Linguistic-Hallucination .
Loading