Abstract: Surface name is the string used to refer to an entity in a text corpus. Crowd-sourced knowledge repositories such as Wikipedia can have multiple types of errors, including surface name errors. This paper focuses on identifying and correcting surface name errors in Wikipedia. This problem is important for two reasons. First, Wikipedia is a popular knowledge repository, and it is used by millions of people every day. Second, many machine learning models use Wikipedia as a training dataset. Existing work addresses many related problems such as Entity Linking. However, there is no direct work that deals with surface name errors. The intuition of our work is that infrequent surface names are typically wrong. Using this heuristic, we analyze Wikipedia in eight languages. We observe that about 3 to 6% mentions in Wikipedia have surface name errors. We verify the quality of our work in three ways. First, we manually analyze a small sample of predicted surface name errors and update Wikipedia based on our predictions. The Wikipedia community accepted more than 99% of the corrections suggested by our work. Second, we compare two snapshots of Wikipedia to estimate the quality of our surface name error prediction. Third, we have built a web-based feedback portal where people can give feedback about our surface name error predictions.
0 Replies
Loading