Abstract: The rapid growth of the Internet and big data has led to the generation of large-scale multimodal data, presenting challenges for traditional retrieval methods. These methods often rely on a two-stage architecture involving retrieval and reranking, which struggles with integrating the semantic differences between visual and textual modalities. This limitation hampers the fusion of information and reduces the accuracy and efficiency of cross-modal retrieval. To overcome these challenges, we propose FusionRM, a language-guided cross-modal semantic fusion retrieval method. FusionRM utilizes the expressive power of textual semantics to bridge the knowledge gap between visual and linguistic modalities. By combining implicit visual knowledge with explicit textual knowledge, FusionRM creates a unified embedding space, aligning semantics across modalities and improving retrieval accuracy and efficiency of multimodal information processing. Experiments on the multi-hop, multimodal WebQA dataset show that FusionRM outperforms traditional methods across multiple metrics, demonstrating superior performance and generalization in open-domain retrieval.
Loading