Keywords: open information extraction, text analytics
TL;DR: An Open Information Extraction Corpus and its in-depth analysis
Abstract: Open information extraction (OIE) systems extract relations and their
arguments from natural language text in an unsupervised manner. The resulting
extractions are a valuable resource for downstream tasks such as knowledge
base construction, open question answering, or event schema induction. In this
paper, we release, describe, and analyze an OIE corpus called OPIEC, which was
extracted from the text of English Wikipedia. OPIEC complements the available
OIE resources: It is the largest OIE corpus publicly available to date (over
340M triples) and contains valuable metadata such as provenance information,
confidence scores, linguistic annotations, and semantic annotations including
spatial and temporal information. We analyze the OPIEC corpus by comparing its
content with knowledge bases such as DBpedia or YAGO, which are also based on
Wikipedia. We found that most of the facts between entities present in OPIEC
cannot be found in DBpedia and/or YAGO, that OIE facts
often differ in the level of specificity compared to knowledge base facts, and
that OIE open relations are generally highly polysemous. We believe that the
OPIEC corpus is a valuable resource for future research on automated knowledge
base construction.
Archival Status: Archival
Subject Areas: Information Extraction, Applications: Other
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/arxiv:1904.12324/code)
8 Replies
Loading