Abstract: Highlights•Development of FINJ, a novel open-source fault injection tool for HPC systems.•Release of Antarex, a freely-available dataset of faults in an HPC system.•A machine learning model for fault detection designed for online monitoring data.•Faults can be detected online with very high accuracy and low overhead.
Loading