Rethink video retrieval representation for video captioning

Published: 01 Jan 2024, Last Modified: 05 Mar 2025Pattern Recognit. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Multi-grained video-text alignment when extracting visual features for captioning.•A learnable token shift module to enhance fine-grained inter-frame info interaction.•Refineformer provides additional well text-related spatial info for caption decoder.•Favorable performance on MSVD, MSR-VTT and VATEX benchmarks.
Loading