Moving Hadoop into the Cloud with Flexible Slot Management and Speculative Execution

Yanfei Guo, Jia Rao, Changjun Jiang, Xiaobo Zhou

Published: 2017, Last Modified: 17 Apr 2025IEEE Trans. Parallel Distributed Syst. 2017EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Load imbalance is a major source of overhead in parallel programs such as MapReduce. Due to the uneven distribution of input data, tasks with more data become stragglers and delay the overall job completion. Running Hadoop in a private cloud opens up opportunities for expediting stragglers with more resources but also introduces problems that often outweigh the performance gain: (1) performance interference from co-running jobs may create new stragglers; (2) there exists a semantic gap between the Hadoop task management and resource pool-based virtual cluster management preventing tasks from using resources efficiently. In this paper, we strive to make Hadoop more resilient to data skew and more efficient in cloud environments. We present FlexSlot, a user-transparent task slot management scheme that automatically identifies map stragglers and resizes their slots accordingly to accelerate task execution. FlexSlot adaptively changes the number of slots on each virtual node to balance the resource usage so that the pool of resources can be efficiently utilized. FlexSlot further improves mitigation of data skew with an adaptive speculative execution strategy. Experimental results show that FlexSlot effectively reduces job completion time up to $47.2$ percent compared to stock Hadoop and two recently proposed skew mitigation and speculative execution approaches.