Scheduling in mapreduce-like systems for fast completion time

Hyunseok Chang, Murali S. Kodialam, Ramana Rao Kompella, T. V. Lakshman, Myungjin Lee, Sarit Mukherjee

Published: 2011, Last Modified: 06 Nov 2023INFOCOM 2011Readers: Everyone

Abstract: Large-scale data processing needs of enterprises today are primarily met with distributed and parallel computing in data centers. MapReduce has emerged as an important programming model for these environments. Since today's data centers run many MapReduce jobs in parallel, it is important to find a good scheduling algorithm that can optimize the completion times of these jobs. While several recent papers focused on optimizing the scheduler, there exists very little theoretical understanding of the scheduling problem in the context of MapReduce. In this paper, we seek to address this problem by first presenting a simplified abstraction of the MapReduce scheduling problem, and then formulate the scheduling problem as an optimization problem.We devise various online and offline algorithms to arrive at a good ordering of jobs to minimize the overall job completion times. Since optimal solutions are hard to compute (NP-hard), we propose approximation algorithms that work within a factor of 3 of the optimal. Using simulations, we also compare our online algorithm with standard scheduling strategies such as FIFO, Shortest Job First and show that our algorithm consistently outperforms these across different job distributions.

0 Replies