JMERE-R1: Reasoning Enhanced LVLMs for Joint Multimodal  Entity-Relation Extraction

JMERE-R1: Reasoning Enhanced LVLMs for Joint Multimodal Entity-Relation Extraction

ACL ARR 2025 May Submission3958 Authors

19 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Joint Multimodal Entity and Relation Extraction (JMERE) aims to extract structured entity-relation quintuplets from textual sequences with social media images. Large Vision-Language Models (LVLMs) demonstrate impressive performance across various multimodal downstream tasks. However, due to the complexity of quintuple extraction logic and multimodal information fusion, higher demands are placed on the model’s ability to capture associations between modalities and perform reasoning. Current LVLMs still perform poorly on the JMERE task. To address these challenges, we propose JMERE-R1, a novel reasoning-enhanced paradigm for LVLMs. Our method integrates Supervised Fine-Tuning (SFT) with Reinforcement Learning (RL) to guide LVLMs toward autonomous reasoning in multimodal contexts. Furthermore, we employ automatically generated Multimodal Paradigm Chain-of-Thought (MP-CoT) data to encourage the model to focus more on Image and text interaction information. Experimental results show that with only parameter-efficient fine-tuning and reinforcement learning, the LVLM is able to develop autonomous multimodal reasoning capabilities. Combined with our Policy-guided approach for multimodal information capture and association, JMERE-R1 enables the LVLM to achieve significantly stronger performance on the JMERE task.

Paper Type: Long

Research Area: Information Extraction

Research Area Keywords: Joint multimodal entity-relation extraction

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 3958

Loading