Exploring Fine-Grained Relation Alignment for Multimedia Information Retrieval

Published: 2025, Last Modified: 21 Jan 2026PAKDD (1) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multimedia information retrieval aims to retrieve the most similar samples from the database in another modality, given a query of one modality (e.g. text, video, image). This work focuses on the video-text retrieval, which are the two most prevalent forms of multimedia, and explores fine-grained relation alignment. Specifically, we decompose video-text pairs matching into fine-grained levels, which include objects and structural information among the objects. In the video module, we explicitly segment the video into a set of fine-grained frames and adaptively recalibrate the frames using an attention mechanism. Then we design a multimedia fusion graph convolution network to update the fine-grained knowledge. In the text module, we learn shared and private features by constructing dual-paths for each text. To capture and correlate a comprehensive yet unified matching of video-text pairs, the shared features emphasize nouns and verbs that correspond to objects and structural information in the videos, while the private features preserve text-exclusive contextual information. Finally, we develop a fine-grained similarity score, which can lead to aligning the object-to-structure relation from coarse-grained samples using a margin loss. Extensive experiments on public datasets, namely MSR-VTT, VATEX, and PKU FG-XMedia, show the effectiveness of our model surpasses state-of-the-art methods.
Loading