Abstract: In this paper, we present r-kernel, an operating system kernel foundation specifically designed to improve software reliability in networked embedded systems. The key novelty of r-kernel lies in that it exploits the time dimension of software execution to improve robustness. Specifically, r-kernel keeps track of the execution of applications through checkpoints. If one application has been determined to have failed, r-kernel performs rollback operations to restore its state to one of those checkpoints created earlier. For the second round of operation, r-kernel provides a safe mode environment to avoid triggering the same bugs. Finally, if the whole system has crashed, r-kernel relies on watchdog timers to reset the node, and develops a technique called past-run trace reconstruction to locate and report the thread that had caused the system failure. We have implemented r-kernel based on the LiteOS operating system kernel running on the popular MicaZ platform. We demonstrate that it achieves the desired goals above with acceptable overhead.
Loading