Abstract: Highlights•Multi-grained video-text alignment when extracting visual features for captioning.•A learnable token shift module to enhance fine-grained inter-frame info interaction.•Refineformer provides additional well text-related spatial info for caption decoder.•Favorable performance on MSVD, MSR-VTT and VATEX benchmarks.
Loading