Abstract: Many graph algorithms are iterative in nature and can be supported by distributed memory-based systems in a synchronous manner. However, an asynchronous model has been recently proposed to accelerate iterative computations. Nevertheless, it is challenging to recover from failures in such a system, since a typical checkpointing based approach requires many expensive synchronization barriers that largely offset the gains of asynchronous computations.This paper first proposes a fault-tolerant framework that performs recovery by leveraging surviving data, rather than checkpointing. Our fault-tolerant approach guarantees the correctness of computations. Additionally, a novel asynchronous checkpointing method is introduced to further boost the recovery efficiency at the price of nearly zero overhead. Our solutions are implemented on a prototype system, Faiter, to facilitate tolerating failures for asynchronous computations. Also, Faiter performs load balancing on recovery by re-assigning lost data onto multiple machines. We conduct extensive experiments to show the effectiveness of our proposals using a broad spectrum of real-world graphs.
Loading