Fine-grained Relationship Alignment Network for Video-Text Retrieval

Published: 2025, Last Modified: 21 Jan 2026ISCAS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Given a query in one modality, video-text retrieval aims to retrieve the most similar samples from the database in another modality. The primary challenge lies in the alignment of fine-grained topology, including objects and interactions among diverse objects. In this paper, we introduce a fine-grained relationship alignment network. Specifically, we adaptively recalibrate frame-wise features of videos and extract fine-grained relationship features, including semantic objects and structural interactions among various objects. Correspondingly, we parse texts into dual paths and encode semantic and structural features. Finally, we combine semantic and structural features to align videos and texts. Remarkably, a negative sample enhanced ranking mechanism is proposed to optimize the network. Experiments on public datasets demonstrate the advantages of our model.
Loading