Abstract: With the amount of data increasing rapidly, how to improve the scalability of nonlinear clustering has become a very crucial and challenging problem. In this paper, we design an efficient parallel nonlinear clustering algorithm by using a four-stage MapReduce framework. In our approach, we need to compute two quantities based on distance matrices, which, however, is difficult to compute in a MapReduce framework. To address this issue, we propose to process the data in a streaming manner to compute the distance between points while ensuring that the output of the original nonlinear clustering algorithm is unchanged. Our algorithm is able to compute the distances between points in parallel, and use these distances to compute the density and the min-distances, with the help of which we can further determine the centers of clusters and therefore discover nonlinear clusters. Extensive experiments have been conducted to demonstrate the efficiency of the proposed approach.
0 Replies
Loading