Video and text semantic center alignment for text-video cross-modal retrieval

Ming Jin, Huaxiang Zhang, Lei Zhu, Jiande Sun, Li Liu

Published: 2026, Last Modified: 21 Jan 2026Signal Process. Image Commun. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the proliferation of video on the Internet, users demand higher precision and efficiency of retrieval technology. The current cross-modal retrieval technology mainly has the following problems: firstly, there is no effective alignment of the same semantic objects between video and text. Secondly, the existing neural networks destroy the spatial features of the video when establishing the temporal features of the video. Finally, the extraction and processing of the text’s local features are too complex, which increases the network complexity. To address the existing problems, we proposed a text-video semantic center alignment network. First, a semantic center alignment module was constructed to promote the alignment of semantic features of the same object across different modalities. Second, a pre-trained BERT based on a residual structure was designed to protect spatial information when inferring temporal information. Finally, the “jieba” library was employed to extract the local key information of the text, thereby simplifying the complexity of local feature extraction. The effectiveness of the network structure was evaluated on the MSVD, MSR-VTT, and DiDeMo datasets.

External IDs:dblp:journals/spic/JinZZSL26