Contrastive Pre-Training for Multimodal Multi-Hop Question Answering Representations

Yongliang Wu; Yujie Liu; Kang Wang; Sensen Jia; Ziqi Shi

Contrastive Pre-Training for Multimodal Multi-Hop Question Answering Representations

Yongliang Wu, Yujie Liu, Kang Wang, Sensen Jia, Ziqi Shi

17 Sept 2025 (modified: 09 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-Hop Question Answering, Multimodal Reasoning, Contrastive Pretraining, Multimodal Representation

TL;DR: We propose a multimodal multi-hop repre-sentation-based contrastive pre-training (MMRCP) approach, which can effectively fuse multimodal and multi-hop ques-tion-answering features to enhance the reasoning performance of question-answering tasks.

Abstract: The multimodal multi-hop question-answering(MMQA) task is the most representative multimodal reasoning task, with its primary goal being to perform multi-step logical reasoning based on multimodal questions to obtain accurate answers. Existing MMQA methods based on large language models (LLMs) have progressed; however, this study still faces challenges in fusing multimodal reasoning features and reasoning with multi-hop questions. To address the above issues, we propose a multimodal multi-hop representation-based contrastive pre-training (MMRCP) approach, which can effectively fuse multimodal and multi-hop question-answering features to enhance the reasoning performance of question-answering tasks. It employs two loss functions for contrastive learning training: cross-modal contrastive learning and reasoning-aware contrastive learning, which effectively obtain basic multimodal semantic and question-answering reasoning features. Subsequently, we construct a multi-hop representation fusion module that combines multimodal reasoning features to perform lightweight adaptation for multi-hop question answering reasoning tasks. Ex-tensive experiments on three real-world multi-hop question-answering datasets demonstrate that MMRCP outperforms multi-hop question-answering baselines by 3% and 4% in precision and error rate, respectively. MMRCP provides a promising direction for future multimodal reasoning tasks.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 9471

Loading