A Holistic Stream Partitioning Algorithm for Distributed Stream Processing SystemsDownload PDFOpen Website

Published: 01 Jan 2019, Last Modified: 15 May 2023PDCAT 2019Readers: Everyone
Abstract: The performances of modern distributed stream processing systems are critically affected by the distribution of the load across workers. Skewed data streams in real world are very common and pose a great challenge to these systems, especially for stateful applications. Key splitting, which allows a single key to be routed to multiple workers, is a great idea to achieve good balance of load in the cluster. However, it comes with the cost of increased memory consumption and computation overhead as well as network communication. In this paper, we present a new definition of metric to model the cost of key splitting for intra-operator parallelism in stream processing systems and provide a novel perspective to reduce replication factor while keeping both overall load imbalance and processing latency low. Similar to previous work, our approach treats the head and the tail of the distribution differently in order to reduce memory requirements. For the head, it uses our proposed notion of regional load imbalance to decide dynamically whether to make one more worker responsible for the heavy hitter or not. For the tail, it simply uses hash partitioning to keep the size of the routing table for the head as small as possible. Extensive experimental evaluation demonstrates that our approach provides superior performance compared to the state-of-the-art partitioning algorithms in terms of load imbalance, replication factor and latency over different levels of skewed stream distributions.
0 Replies

Loading