Abstract: Large scale applications nowadays continuously generate massive amounts of data at high speed. Stream processing engines (SPEs) such as Apache Storm and Flink are becoming increasingly popular because they provide reliable platforms to process such fast data streams in real time.Despite previous research in the field of auto-scaling of resources, current SPEs, whether open source such as Apache Storm, or commercial such as streaming components in IBM Infosphere and Microsoft Azure, lack the ability to automatically grow and shrink to meet the needs of streaming data applications. Moreover, previous research on auto-scaling focuses on techniques for scaling resources reactively, which can delay the scaling decision unacceptably for time sensitive stream applications. To the best of our knowledge, there has been no or limited research using machine learning techniques to proactively predict future bottlenecks based on the data flow characteristics of the data stream workload.In this position paper, we present our vision of a three-stage framework to auto-scale resources for SPEs in the cloud. In the first stage, the workload model is created using data flow characteristics. The second stage uses the output of the workload model to predict future bottlenecks. Finally, the third stage makes the scaling decision for the resources. We begin with a literature review on the auto-scaling of popular SPEs such as Apache Storm.
Loading