Centering Similarity Measures to Reduce Hubs

Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, Marco Saerens, Kenji Fukumizu

2013 (modified: 16 Jul 2019)EMNLP 2013Readers: Everyone

Abstract: The performance of nearest neighbor methods is degraded by the presence of hubs, i.e., objects in the dataset that are similar to many other objects. In this paper, we show that the classical method of centering, the transformation that shifts the origin of the space to the data centroid, provides an effective way to reduce hubs. We show analytically why hubs emerge and why they are suppressed by centering, under a simple probabilistic model of data. To further reduce hubs, we also move the origin more aggressively towards hubs, through weighted centering. Our experimental results show that (weighted) centering is effective for natural language data; it improves the performance of the k-nearest neighbor classi- fiers considerably in word sense disambiguation and document classification tasks.

0 Replies