RSANet: Relationship-Aware Symmetric Alignment Network for Fine-Grained Video-Text Retrieval

Published: 2024, Last Modified: 21 Jan 2026PRICAI (4) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video-text retrieval aims to retrieve the most similar samples from the database in another modality, given a query of one modality (e.g. text, video). The primary challenge lies in capturing the fine-grained relationship, including individual objects features and interactions among diverse objects. Existing approaches predominantly prioritize text parsing independently, neglecting video parsing and thereby not effectively capturing the complementary relations across video-text pairs. In this paper, we introduce a novel Relationship-aware Symmetric Alignment Network (RSANet) for Fine-grained Video-Text Retrieval. Specifically, we first adaptively recalibrate frame-wise features of videos and extract relationship-aware fine-grained features from the recalibrated frames. Subsequently, a tailored heterogeneous graph convolution network is formulated to encode each recalibrated frame. Correspondingly, we parse texts into relationship-aware nodes and employ a pre-trained model to extracts the contextual features of nodes. As a result, relationship-aware cross-modality features can be obtained, which enables the alignment in a more plausible fine-grained manner. In addition, a negative sample enhanced ranking loss is proposed to optimize the RSANet, which promotes the model output with a larger inter-class variation and a smaller intra-class variation. Extensive experiments on three public datasets, namely MSR-VTT, VATEX, and PKU FG-XMedia, show the effectiveness of RSANet surpasses state-of-the-art methods.
Loading