Abstract: Spatial join has become a frequently used yet resource-intensive operation in geospatial applications, driven by the increasing volume and complexity of geospatial data. With Hadoop and Spark becoming the de facto standard platforms for distributed computing, scalable spatial data processing is primarily achieved by partitioning the input space to form parallel units on these platforms. Effective spatial data partitioning is critical for task parallelization and load balancing, but it faces significant challenges due to data skew and the geometric and topological complexity of spatial objects, particularly in supporting spatial joins. This paper examines the interplay among query performance, spatial data partitioning, query types, data, and system characteristics. We qualitatively and quantitatively analyze the features of representative partitioning algorithms that impact overall query performance. Along with these analyses, we propose a data sampling-based approach for selecting optimized partitioning strategies. Extensive experiments on large and complex datasets using MapReduce frameworks are conducted to validate the correctness of our analysis and the effectiveness of our optimization approach.
Loading