Towards Big Data Analytics across Multiple Clusters

Dongyao Wu, Sherif Sakr, Liming Zhu, Huijun Wu

Published: 2017, Last Modified: 12 Jun 2025CCGrid 2017EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Big data are increasingly collected and stored in a highly distributed infrastructures due to the development of sensor network, cloud computing, IoT and mobile computing among many other emerging technologies. In practice, the majority of existing big-data-processing frameworks (e.g., Hadoop and Spark) are designed based on the single-cluster setup with the assumptions of centralized management and homogeneous connectivity which makes them sub-optimal and sometimes infeasible to apply for scenarios that require implementing data analytics jobs on highly distributed data sets (across racks, data centers or multi-organizations). In order to tackle this challenge, we present HDM-MC, a multi-cluster big data processing framework which is designed to enable the capability of performing large scale data analytics across multi-clusters with minimum extra overhead due to additional scheduling requirements. In this paper, we present the architecture and realization of the system. In addition, we evaluate the performance of our framework in comparison to other state-of-art single cluster big data processing frameworks.