MLLM as video narrator: Mitigating modality imbalance in video moment retrieval

Published: 01 Jan 2025, Last Modified: 15 May 2025Pattern Recognit. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•A text-enhanced alignment paradigm for addressing modality gap in moment retrieval.•Multi-modal large language model creates structured aligned narratives for retrieval.•Extensive experiments on two popular benchmarks show effective vision-text learning.
Loading