Abstract: Joint Multimodal Entity and Relation Extraction (JMERE) aims to extract structured entity-relation quintuplets from textual sequences with social media images. Large Vision-Language Models (LVLMs) demonstrate impressive performance across various multimodal downstream tasks. However, due to the complexity of quintuple extraction logic and multimodal information fusion, higher demands are placed on the model’s ability to capture associations between modalities and perform reasoning. Current LVLMs still perform poorly on the JMERE task.
To address these challenges, we propose JMERE-R1, a novel reasoning-enhanced paradigm for LVLMs. Our method integrates Supervised Fine-Tuning (SFT) with Reinforcement Learning (RL) to guide LVLMs toward autonomous reasoning in multimodal contexts. Furthermore, we employ automatically generated Multimodal Paradigm Chain-of-Thought (MP-CoT) data to encourage the model to focus more on Image and text interaction information.
Experimental results show that with only parameter-efficient fine-tuning and reinforcement learning, the LVLM is able to develop autonomous multimodal reasoning capabilities. Combined with our Policy-guided approach for multimodal information capture and association, JMERE-R1 enables the LVLM to achieve significantly stronger performance on the JMERE task.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: Joint multimodal entity-relation extraction
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 3958
Loading