Keywords: Fact Linking, Multilinguality, KGs, Dual-Cross Encoders, Retrieval+Generation
TL;DR: We introduce Multilingual Fact Linking (MFL), a new task to link KG facts in one language to text of another language. We present a new dataset, IndicLink, and demonstrate effectiveness of a Retrieval+Generation model for the MFL task.
Abstract: Knowledge-intensive NLP tasks can benefit from linking natural language text with facts from a Knowledge Graph (KG). Although facts themselves are language-agnostic, the fact labels (i.e., language-specific representation of the fact) in the KG are often present only in a few languages. This makes it challenging to link KG facts to sentences in languages other than the limited set of languages. To address this problem, we introduce the task of Multilingual Fact Linking (MFL) where the goal is to link fact expressed in a sentence to corresponding fact in the KG, even when the fact label in the KG is not available in the language of the sentence. We additionally consider cases where the sentence does not contain the complete fact but expresses it only partially. To facilitate research in this area, we present a new evaluation dataset, IndicLink. This dataset contains 11,293 linked WikiData facts and 6,429 sentences spanning English and six Indian languages. We propose a Retrieval+Generation model, ReFCoG, that can scale to millions of KG facts by combining Dual Encoder based retrieval with a Seq2Seq based generation model which is constrained to output only valid KG facts. ReFCoG outperforms standard Retrieval+Re-ranking models by 10.7 pts in Precision@1. In spite of this gain, the model achieves an overall score of 52.1, showing ample scope for improvement in the task.
Subject Areas: Knowledge Representation, Semantic Web and Search, Information Extraction
Archival Status: Archival