An Efficient Fault Tolerance Strategy for Multi-task MapReduce Models Using Coded Distributed Computing
Abstract: MapReduce is a programming framework designed for processing and analyzing large volumes of data in a distributed computing environment. Despite its capabilities, it faces challenges due to silent data corruption during task execution, which can yield inaccurate results. Ensuring fault tolerance in the MapReduce framework while minimizing communication overhead presents considerable challenges. This study presents CDCFT (Coded Distributed Computing Fault Tolerance), a novel approach to fault tolerance within the MapReduce paradigm, combining the strengths of TMR (Triple Modular Redundancy) and CDC (Coded Distributed Computing). By leveraging task-level TMR and voting mechanisms, CDCFT robustly defends against silent data corruption. To further optimize, CDCFT employs intra-group broadcasts for relaying intermediate messages and has a finely-tuned node grouping combined with a strategic data and task allocation procedure. Through rigorous theoretical analysis, we establish that CDCFT’s communication overhead during the Shuffle Stage is notably less than traditional CDC methods that rely on triple modular redundancy. Experimental results showcase the efficacy of CDCFT, signifying a substantial reduction in the overall communication overhead and execution time compared to the conventional fault-tolerant methods.
Loading