One Size Cannot Fit All: A Self-adaptive Dispatcher for Skewed Hash Join in Shared-Nothing RDBMSs

Published: 01 Jan 2024, Last Modified: 13 May 2025DASFAA (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Shared-nothing architecture has been widely adopted in rdbmss so that query can be processed in parallel and accelerated by scaling up the cluster horizontally on demand. However, in practice, skewed data distribution presents a great challenge in load balancing for these rdbmss. In this work, we focus on one of the representative operators, namely Hash Join, and investigate how skewness among the nodes of a cluster will affect the load balance and eventual efficiency. We found that existing distributed Hash Join (Dist-HJ) solutions may not provide satisfactory performance when a value is skewed in both the probe and build tables. To address that, we propose a novel Dist-HJ solution, namely Partition and Replication (PnR). Although PnR provide the best efficiency in some skewness scenarios, our exhaustive experiments show that there is no single Dist-HJ solution that wins in all scenarios (data skewness). To this end, we further propose a self-adaptive Dist-HJ solution with a built-in suboperator cost model that dynamically selects the best Dist-HJ strategy at runtime according to the data skew of the target query. We implement the solution in the commercial shared-nothing rdbms, namely CockroachDB and empirical study justifies that the self-adaptive model achieves the best performance compared to a series of solutions adopted in many existing rdbmss.
Loading