Abstract: Highlights•A text-enhanced alignment paradigm for addressing modality gap in moment retrieval.•Multi-modal large language model creates structured aligned narratives for retrieval.•Extensive experiments on two popular benchmarks show effective vision-text learning.
Loading