Multidimensional Similarity Join Using MapReduce

Published: 2016, Last Modified: 06 Feb 2025WAIM (2) 2016EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Similarity join is arguably one of the most important operators in multidimensional data analysis tasks. However, processing a similarity join is costly especially for large volume and high dimensional data. In this work, we attempt to process the similarity join on MapReduce such that the join computation can be scaled horizontally. In order to make the workload balancing among all MapReduce nodes, we systemically select the most profitable feature based on a novel data selectivity approach. Given the selected feature, we develop the partitioning scheme for MapReduce processing based on two different optimization goals. Our proposed techniques are extensively evaluated on real datasets.
Loading