Abstract: OpenMLDB is an open-source machine learning database, that provides a feature platform computing consistent features for training and inference. The online interval join (OIJ), i.e., joining two input streams over relative time intervals, is becoming a core operation in OpenMLDB. Its costly nature and intrinsic parallelism opportunities have created significant interest in accelerating OIJ on modern multicore processors. In this work, we first present an in-depth empirical study on an existing parallel OIJ algorithm (Key-OIJ), which applies a key-partitioned parallelization strategy. Key-OIJ has been implemented in Apache Flink and used in real-world applications. However, our study points out the limitations of Key-OIJ, and reveals that Key-OIJ is not capable of fully exploiting modern multicore processors. Based on our analysis, we propose a new approach, the Scale-OIJ algorithm with a set of optimization techniques. Compared with Key-OIJ, Scale-OIJ is particularly efficient for handling workloads involving fewer keys, large time intervals, and large lateness configurations. The extensive experiments using real workloads have demonstrated the superior performance of Scale-OIJ. Furthermore, we have partially integrated and tested Scale-OIJ in the latest version of OpenMLDB, demonstrating its practicality in a machine learning database.
Loading