An optimized hierarchical MapReduce framework in supercomputing Internet environment

Published: 2025, Last Modified: 07 Jan 2026CCF Trans. High Perform. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Distributed computing frameworks play a crucial role in supporting compute-intensive applications in the era of big data. The growing demand for computing resources has spurred the interconnection of data centers, leading to the formation of supercomputing Internet. MapReduce is a popular distributed computing framework designed for large independent clusters. The original MapReduce framework deployed on supercomputing Internet performs inefficiently due to redundant geo-distributed reduce operations. Nonetheless, its abstraction remains significant potential. This paper proposes an enhanced MapReduce framework for geo-distributed supercomputing Internet to minimize the necessity for data transmission across data centers. Leveraging hierarchical scheduling techniques, the framework optimizes data locality to mitigate network latency and bandwidth consumption during reduce operations, thereby reducing overall job execution times. The paper introduces a mathematical model for task scheduling within supercomputing Internet and formally describes the data transmission process among data centers. In the job scheduling phase, our framework facilitates efficient overlap of transferring and computing through pre-selected data centers. Meanwhile, in the data transmission phase, the framework aggregate data to reduce the frequency of transmission, thus alleviating the adverse effects on transmission of hierarchical network architecture. Comparative analysis with existing methods demonstrates the efficacy of the proposed framework in addressing similar computational challenges. Empirical evaluations underscore the effectiveness of our method in practice.
Loading