Coverage Guided Fault Injection for Cloud Systems

Published: 01 Jan 2023, Last Modified: 13 Nov 2024ICSE 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: To support high reliability and availability, modern cloud systems are designed to be resilient to node crashes and reboots. That is, a cloud system should gracefully recover from node crashes/reboots and continue to function. However, node crashes/reboots that occur under special timing can trigger crash recovery bugs that lie in incorrect crash recovery protocols and their implementations. To ensure that a cloud system is free from crash recovery bugs, some fault injection approaches have been proposed to test whether a cloud system can correctly recover from various crash scenarios. These approaches are not effective in exploring the huge crash scenario space without developers' knowledge. In this paper, we propose Crash Fuzz, a fault injection testing approach that can effectively test crash recovery behaviors and reveal crash recovery bugs in cloud systems. CrashFuzz mutates the combinations of possible node crashes and reboots according to runtime feedbacks, and prioritizes the combinations that are prone to increase code coverage and trigger crash recovery bugs for smart exploration. We have implemented CrashFuzz and evaluated it on three popular open-source cloud systems, i.e., ZooKeeper, HDFS and HBase. CrashFuzz has detected 4 unknown bugs and 1 known bug. Compared with other fault injection approaches, CrashFuzz can detect more crash recovery bugs and achieve higher code coverage.
Loading