- Keywords: Resource construction, Structured Wikipedia
- Abstract: We are reporting the SHINRA project, a project for structuring Wikipedia with collaborative construction scheme. The goal of the project is to create a huge and well-structured knowledge base to be used in NLP applications, such as QA, Dialogue systems and explainable NLP systems. It is created based on a scheme of ”Resource by Collaborative Contribution (RbCC)”. We conducted a shared task of structuring Wikipedia, and at the same, submitted results are used to construct a knowledge base. There are machine readable knowledge bases such as CYC, DBpedia, YAGO, Freebase Wikidata and so on, but each of them has problems to be solved. CYC has a coverage problem, and others have a coherence problem due to the fact that these are based on Wikipedia and/or created by many but inherently incoherent crowd workers. In order to solve the later problem, we started a project for structuring Wikipedia using automatic knowledge base construction shared-task. The automatic knowledge base construction shared-tasks have been popular and well studied for decades. However, these tasks are designed only to compare the performances of different systems, and to find which system ranks the best on limited test data. The results of the participated systems are not shared and the systems may be abandoned once the task is over. We believe this situation can be improved by the following changes: 1. designing the shared-task to construct knowledge base rather than evaluating only limited test data 2. making the outputs of all the systems open to public so that we can run ensemble learning to create the better results than the best systems 3. repeating the task so that we can run the task with the larger and better training data from the output of the previous task (bootstrapping and active learning) We conducted “SHINRA2018” with the above mentioned scheme and in this paper we report the results and the future directions of the project. The task is to extract the values of the pre-defined attributes from Wikipedia pages. We have categorized most of the entities in Japanese Wikipedia (namely 730 thousand entities) into the 200 ENE categories. Based on this data, the shared-task is to extract the values of the attributes from Wikipedia pages. We gave out the 600 training data and the participants are required to submit the attribute-values for all remaining entities of the same category type. Then 100 data out of them for each category are used to evaluate the system output in the shared-task. We conducted a preliminary ensemble learning on the outputs and found 15 F1 score improvement on a category and the average of 8 F1 score improvements on all 5 categories we tested over a strong baseline. Based on this promising results, we decided to conduct three tasks in 2019; multi-lingual categorization task (ML), extraction for the same 5 categories in Japanese with a larger training data (JP-5) and extraction for 34 new categories in Japanese (JP-34).
- Archival status: Non-Archival
- Subject areas: Natural Language Processing, Information Extraction, Information Integration, Crowd-sourcing, Other
- TL;DR: We introduce a "Resource by Collaborative Construction" scheme to create KB, structured Wikipedia