Point and Line: Multilingual Mutual Reinforcement Effect Mix Information Extraction Datasets

ACL ARR 2025 February Submission585 Authors

09 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

The Mutual Reinforcement Effect (MRE) represents a promising avenue in information extraction and multitasking research. Nevertheless, its applicability has been constrained due to the exclusive availability of MRE mix datasets in Japanese, thereby limiting comprehensive exploration by the global research community. To address this limitation, we introduce a Multilingual MRE mix dataset (MMM) that encompasses 21 sub-datasets in English, Japanese, and Chinese. In this paper, we also propose a method for dataset translation assisted by Large Language Models (LLMs), which significantly reduces the manual annotation time required for dataset construction by leveraging LLMs to translate the original Japanese datasets. Additionally, we have enriched the dataset by incorporating open-domain Named Entity Recognition (NER) and sentence classification tasks. Utilizing this expanded dataset, we developed a unified input-output framework to train an Open-domain Information Extraction Large Language Model (OIELLM). The OIELLM model demonstrates the capability to effectively process novel MMM datasets, exhibiting significant improvements in performance. Furthermore, we conducted a new ablation study to evaluate the MRE across 21 MMM sub-datasets. The results demonstrated that 76% of the datasets exhibited MRE, reinforcing its robustness. Additionally, we applied the MRE datasets to a knowledgeable verbalizer (KV), and the results confirmed that KV constructed by MRE Mix datasets achieved superior KV performance. This further validates the effectiveness of MRE in enhancing IE subtasks.

Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: data influence, open information extraction, language resources
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English, Chinese, Japanese
Submission Number: 585
Loading