Haery: A Hadoop Based Query System on Accumulative and High-Dimensional Data Model for Big Data

Published: 2020, Last Modified: 01 Apr 2026IEEE Trans. Knowl. Data Eng. 2020EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Column-oriented stores, known for their scalability and flexibility, are a common NoSQL database implementation and are increasingly used in big data management. In column-oriented stores, a “full-scan” query strategy is inefficient and the search space can be reduced if data is well partitioned or indexed; however, there is no pre-defined schema for building and maintaining partitions and indexes at lower cost. We leverage an accumulative and high-dimensional data model, a sophisticated linearization algorithm, and an efficient query algorithm, to solve the challenge of how a pre-defined and well-partitioned data model can be applied to flexible and time-varied key-value data. We adapt a high-dimensional array as the data model to partition the key-value data without additional storage and massive calculation; improve the Z-order linearization algorithm, which map multidimensional data to one dimension while preserving locality of the data points, for flexibility; efficiently build an expansion mechanism for the data model to support time-varied data. The result is Haery, a column-oriented store, based on a distributed file system and computing framework. In experiments, Haery is compared with Hive, HBase, Cassandra, MongoDB, PostgresXL, and HyperDex in terms of query performance. With results indicating Haery on average performs 4.57x, 4.23x, 3.55x, 1.79x, 1.82x, and 120.6x faster, respectively.
Loading