Multimodal Chemical Structure-Text Coreference in Intellectual Property via Rule-guided Reinforcement Learning

Multimodal Chemical Structure-Text Coreference in Intellectual Property via Rule-guided Reinforcement Learning

ACL ARR 2026 January Submission3371 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Chemical Stucuture-Text Coreference, Multimodal Reinforcement Learning

Abstract: Navigating biopharmaceutical intellectual property necessitates precisely associating visual chemical structures with their textual referents across lengthy documents. Despite its critical role in drug discovery, this multimodal coreference task remains underexplored. It presents unique challenges, including handling Markush structures and distinguishing the atom-level differences between adjacent structures. To bridge this gap, we define the multimodal **Che**mical **S**tructure-**T**ext coreference and introduce **CheST**, the first dataset explicitly designed for the task. Furthermore, to satisfy the strict logical consistency in the task, we propose **RULER**, a **RULE**-guided multimodal **R**einforcement learning framework built upon an SFT cold start. RULER utilizes rule-driven reward functions operationalizing multidimensional consistencies, acting as a domain-specific "verifier" to obtain the correct domain knowledge. Experimental results demonstrate that RULER achieves a 40\% improvement over the strongest baseline--Gemini-2.5-Pro, demonstrating the superior efficacy.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal information extraction, cross-modal application, multimodality

Languages Studied: English

Submission Number: 3371

Loading