Strark-H: A Strategy for Spatial Data Storage to Improve Query Efficiency Based on Spark

Weitao Zou, Weipeng Jing, Guangsheng Chen, Yang Lu

Published: 01 Jan 2019, Last Modified: 12 Apr 2024ICA3PP (1) 2019Readers: Everyone

Abstract: In this paper, we propose Strark-H, a storage and query strategy for large-scale spatial data based on Spark, to improve the response speed of spatial query by considering the spatial location and category keywords of spatial objects. Firstly, we define a custom InputFormat class to make spark natively understand the content of Shapefile, which is a common file format to store spatial data. Then, we put forward a partition and indexing method for spatial storage, based on which spatial data is partitioned unevenly according to the spatial position, which ensures the size of each partition does not exceed the block in HDFS and preserve the spatial proximity of spatial objects in the cluster. Moreover, a secondary index is generated, including global index based on spatial position for all partitions as well as local index based on category of spatial objects. Finally, we design a new data loading and query scheme based on Strark-H for spatial queries including range query, K-NN query and spatial join query. Extensive experiments on OSM show that Strark-H can be applied to Spark to natively support spatial query and storage with efficiency and scalability.

0 Replies