Abstract: Distributed systems are expected to correctly recover from various faults, e.g., node crash / reboot and network disconnection / reconnection. However, faults that occur under special timing can trigger fault recovery bugs that are rooted in incorrect fault recovery protocols and implementations. Existing random and brute-force fault injection approaches are not effective in revealing fault recovery bugs due to the combinatorial explosion of multiple faults in distributed systems.In this paper, we propose FaultFuzz, a coverage guided fault injection approach that can systematically and effectively test fault recovery behaviors in distributed systems. Based on runtime feedbacks collected from distributed system testing, e.g., code coverage and I/O information, FaultFuzz generates possible combinations of faults, and preferentially selects the combinations that are more likely to trigger new fault recovery behaviors and reveal new fault recovery bugs. We have applied FaultFuzz on three widely-used distributed systems, i.e., Zookeeper, HDFS and HBase and found 5 bugs in them. A video demonstration of FaultFuzz is available at https://youtu.be/SMw1ZF1vyXw.
Loading