Abstract: As a fundamental task in the multimodal domain, text-to-video retrieval task has received great attention in recent years. Most of the current research focuses on the interaction between cross-modal coarse-grained features. However, the feature granularity of retrieval models has not been fully explored. Therefore, we introduce video internal region information into cross-modal retrieval and propose a cross-model fine-grained feature retrieval framework. Videos are represented as video-frame-region triple features, texts are represented as sentence-word dual features, and the cross-similarity between visual features and text features is computed through token-wise interaction. It effectively extracts the detailed information in the video, guides the model to pay attention to the effective video region information and keyword information in the sentence, and reduces the adverse effects of redundant words and interfering frames. On the most popular retrieval dataset MSRVTT, the framework achieves state-of-the-art results (51.1@1). Excellent experimental results demonstrate the superiority of fine-grained feature interaction.
0 Replies
Loading