Bi-Cluster: A High-Performance Data Query Framework for Large-Scale Scientific Data

Yixian Shen, Cheng Peng, Yunfei Du, Yutong Lu

Published: 2019, Last Modified: 21 Jul 2025HPCC/SmartCity/DSS 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Emerging scientific computing generates massive amounts of scientific data by relying on high-performance computer systems, challenging data management and analysis. State-of-the-art query technology such as FastQuery supports directly indexing of scientific data but does not support in-situ index construction, build index extremely slow and generate huge volume index files. Block index technology coarsely retrieves a large amount of redundant data increasing the filtering overhead. Therefore, this paper proposes a high-performance query data framework for scientific data. In terms of index generation, a two-tier index data structure is designed to build the index in parallel to reduce the size of the index and speed up index generation. Meanwhile, an in-situ kernel index parallel strategy is proposed to build the index for online real-time generated data. In terms of data retrieval, a two-tier parallel query mechanism is designed to efficiently read data, and a dynamic union read strategy and an adaptive scheduling strategy are used to optimize the data retrieval process. Finally, the Bi-Cluster Framework is evaluated on scientific datasets, which proves that our design achieves good performance. The size of the index and index generation time is much smaller than FastQuery. In terms of retrieving data, data retrieval performance has improved a lot. The scalability of Bi-Cluster is pretty good by evaluating on 12288 cores. on scientific datasets, which proves that our design achieves good performance. The size of the index and index generation time is much smaller than FastQuery. In terms of retrieving data, data retrieval performance has improved a lot. The scalability of Bi-Cluster is pretty good by evaluating on 12288 cores.