Abstract: For big data, the Evidential K-Nearest Neighbor (EK-NN) classifier is still impractical due to the restrictions of time and memory. In both the training and testing stage, searching for K closest neighbors requires intensive quadratic computation and has to be repeated for each input sample. To address this issue, we propose a distributed EK-NN classifier, named Global Exact EK-NN, for fast processing with Apache Spark. We compare the proposed classifier, which can be scaled to 48 nodes (2688 cores) at a cluster named the Texas Advanced Computing Center Frontera, with several other parallel K-NN based algorithms over 4 large datasets. Our method is able to achieve state-of-the-art scaling efficiency and accuracy on the large datasets having more than 10 million samples.
Loading