Abstract: General-purpose distributed systems for data processing become popular in recent years due to the high demand from industry for big data analytics. However, there is a lack of comprehensive comparison among these systems and detailed analysis on their performance, which makes it difficult for users to choose the right systems for their applications and hard for system developers to identify which aspects of a system can be improved. In this paper, we conduct an extensive performance study on four state-of-the-art general-purpose distributed computing systems. We evaluate the performance of these systems on three types of workloads that are very common for big data analytics in industry today, namely non-iterative bulk workloads, iterative graph workloads, and iterative machine learning workloads. Through the study, we identify the strengths and limitations of each system. We also test the scalability and analyze the programming complexity of using each system. Our results reveal useful insights on the design and implementation of general-purpose distributed computing systems, which help the development of better new systems in the future.
0 Replies
Loading