- Abstract: Similarity measurement plays a central role in various data mining and machine learning tasks. Generally, a similarity measurement solution should, in an ideal state, possess the following three properties: accuracy, efficiency and independence from prior knowledge. Yet unfortunately, vital as similarity measurements are, no previous works have addressed all of them. In this paper, we propose X-Forest, consisting of a group of approximate Random Projection Trees, such that all three targets mentioned above are tackled simultaneously. Our key techniques are as follows. First, we introduced RP Trees into the tasks of similarity measurement such that accuracy is improved. In addition, we enforce certain layers in each tree to share identical projection vectors, such that exalted efficiency is achieved. Last but not least, we introduce randomness into partition to eliminate its reliance on prior knowledge. We conduct experiments on three real-world datasets, whose results demonstrate that our model, X-Forest, reaches an efficiency of up to 3.5 times higher than RP Trees with negligible compromising on its accuracy, while also being able to outperform traditional Euclidean distance-based similarity metrics by as much as 20% with respect to clustering tasks. We have released codes in github anonymously so as to meet the demand of reproducibility.
- Code: https://github.com/X-Forest/Approximate-Random-Projection-Trees