Abstract: DRAM failures are one of the leading failures in large-scale clouds. Previous studies focus on predicting DRAM uncorrectable errors (UEs) and mitigating the impact of DRAM failures through node-level workload migration and proactive dual in-line memory module (DIMM) replacement. For cloud systems, node migration means migrating all virtual machines (VMs) to other nodes. Such a coarse-grained migration consumes considerable resources. Inspired by the observation that DRAM errors tend to cluster in space, this paper proposes Pegasus, the first VM-level solution to mitigate DRAM failures. The key idea of Pegasus is to use the VM as the basic unit for predicting DRAM failures instead of the whole node. Specifically, we introduce a new concept of DRAM-caused risky VMs, which causes node unavailability when accessed. We design a novel Error-VM mapping framework deployed on a large-scale cloud. Statistical results confirm that DRAM errors are concentrated in the address space managed by a small number of VMs. By combining ECC and spatio-temporal features, our predictor achieves decent performance. Pegasus has been deployed online on over $\mathbf{3 0 0, 0 0 0}$ nodes in the production cloud. The comparative study shows that the prediction performance of Pegasus is comparable to node-level solutions. Meanwhile, our approach offers over $\mathbf{7 0 \%}$ lower costs and avoids more than $\mathbf{1 0 \%}$ of VM crashes compared to node-level mitigation.
External IDs:dblp:conf/hpca/YongDMW0ZY25
Loading