Large-scale author coreference via hierarchical entity representations

Michael Wick, Ari Kobren, Andrew McCallum

May 08, 2013 (modified: May 08, 2013) ICML 2013 PeerReview submission readers: everyone
  • Decision: oral
  • Abstract: Large-scale author coreference, the problem of ascribing research papers to real-world authors in bibliographic databases, is critical for mining the scientific community. However, traditional pairwise approaches, which measure coreference similarity between pairs of author mentions, scale poorly to large databases; and streaming approaches, which lack the ability to retroactively correct errors, can suffer from chronically low accuracy. In this paper we present a hierarchical model for solving author coreference that overcomes these issues. First, our model enables scalability over rich entity representations by compactly organizing the mentions of each author into trees. Second, we employ Markov chain Monte Carlo (MCMC) inference which is able to retroactively correct existing coreference errors when processing new mentions. We validate these two properties empirically, and demonstrate further scalability through asynchronous parallel MCMC (allowing us to scale to all 150,000,000 author mentions in Web of Science).