A KAN-based lightweight modality fusion method for video-text retrieval

15 Sept 2025 (modified: 15 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Information Retrieval, modality fusion, state space model
TL;DR: A lightweight modality fusion method used in video-text retrieval, which is a domain in the video understanding.
Abstract: Different from text-to-text retrieval tasks, video-text retrieval faces unique challenges due to the inherent modality gap between high-dimensional visual and textual data, which significantly limits model performance. To address this issue, many existing works employ modality fusion techniques to enhance accuracy, while the attention mechanism in Transformers is widely adopted for cross-modal alignment. However, the quadratic computational complexity of attention results in extremely high memory costs, becoming a major obstacle for efficient training and real-world inference. To tackle these challenges, this paper proposes KFusion, a lightweight yet effective framework for video-text fusion. Specifically, we design a Kolmogorov-Arnold-Network-based Bridge module and a Text-Frame Mamba module. The Bridge leverages learnable spline-based activation functions to capture cross-modal interactions and compute adaptive weights for text and video features. Nevertheless, irrelevant or noisy information often weakens the fusion effect. To mitigate this, the Text-Frame Mamba module introduces separate Mamba backbones, which filter out unimportant signals from both textual and visual embeddings using state space models. The filtered features are then weighted by the Bridge outputs to achieve efficient and robust fusion. Extensive experiments on four benchmark datasets—MSR-VTT, MSVD, ActivityNet, and DiDeMo—demonstrate that KFusion achieves state-of-the-art performance in both accuracy and efficiency.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 5495
Loading