Modality-Balanced Decoupling Alignment for Text-Video Retrieval

Feng Wang; Ruyang Liu; Xinpeng Liu; Shiqiang Long; Ge Li

Modality-Balanced Decoupling Alignment for Text-Video Retrieval

Feng Wang, Ruyang Liu, Xinpeng Liu, Shiqiang Long, Ge Li

20 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Understanding, Text-Video Retrieval, Modality-Balanced Decoupling Alignment

TL;DR: In text-video retrieval, we propose Modality-Balanced Decoupling Alignment to address the challenge posed by the imbalance in multimodal representation space.

Abstract: Text-video retrieval, the task of retrieving videos given a text query or vice versa, plays a significant role in video understanding. A significant challenge in this task is the semantic gap between video and text, primarily caused by the disparity in information capacity and the highly coupled nature of video information. Existing alignment methods mainly focus on multi-grained alignment between videos and text, which fails to address the capacity imbalance between video and text feature space. To address these issues, we propose Modality-Balanced Decoupling Alignment (MBDA) , a novel method that align the two modalities with closer distribution and more balanced information capacity in the feature space. Specifically, our model consists of two modules. The Modality Proximity Alignment module brings the video embedding closer to the text embedding, while the Video Representation Orthogonal Decoupling module separates the aligned video embedding into two orthogonal components, achieving better balance with their textual counterparts. Furthermore, we demonstrate that our decoupling approach achieves orthogonality while eliminating information redundancy among components through low-rank decomposition and frequency-domain analysis via Discrete Fourier Transform. The proposed method improves the baseline by a large margin. Extensive experiments demonstrate that MBDA achieves state-of-the-art performance on four most widely used public benchmarks, MSR-VTT(52.4%), DiDeMo(53.1%), MSVD(54.0%), and ActivityNet(49.6%).

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 22750

Loading