Abstract: This paper presents a checkpoint and recovery (C&R) protocol to support fault-tolerance for PVM (Parallel Virtual Machine). The protocol helps to mask fail-stop failures from an application. The C&R activities are transparent and do not require any change in the PVM library nor operating system. In PVM, an application can change the number of processes during execution. This paper focuses on solving problems raised by the dynamic spawn and the asynchronous exit of tasks in PVM. The proposed protocol is a non-blocking one, so it reduces side-effect of checkpoint activities of original programs.
Loading