Abstract: The name disambiguation task partitions a collection of records pertaining to a given name, such that there is a one-to-one correspondence between the partitions and a group of people, all sharing that given name. Most existing solutions for this task are proposed for static data. However, more realistic scenarios stipulate emergence of records in a streaming fashion where records may belong to known as well as unknown persons all sharing the same name. This requires a flexible name disambiguation algorithm that can not only classify records of known persons represented in the training data by their existing records but can also identify records of new ambiguous persons with no existing records included in the initial training dataset. Toward achieving this objective, in this paper we present a non-parametric Bayesian framework that utilizes a Dirichlet Process Gaussian Mixture Model (DPGMM) as a core engine for online name disambiguation task. A Sequential Importance Sampling with Resampling (SISR) technique, also known as particle filtering, is proposed for inference to simultaneously perform online classification and new class discovery. Specifically, for each online record, we approximate its class conditional posterior distribution by a set of particles and their weights, which are updated in a sequential manner without the need to re-access previously observed records. We also propose an interactive version of our online name disambiguation method, which improves the prediction accuracy by exploiting user feedback.
0 Replies
Loading