Block-Checksum-Based Fault Tolerance for Matrix Multiplication on Large-Scale Parallel Systems

Yanchao Zhu; Yi Liu; Mingzhen Li; Depei Qian

Block-Checksum-Based Fault Tolerance for Matrix Multiplication on Large-Scale Parallel Systems

Yanchao Zhu, Yi Liu, Mingzhen Li, Depei Qian

Published: 01 Jan 2018, Last Modified: 15 May 2025HPCC/SmartCity/DSS 2018EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the scaling up of high performance computers, resilience has become a big challenge. Among various kinds of software-based fault-tolerant approaches, the algorithm-based fault tolerance (ABFT) has some attractive characteristics in the era of exa-scale systems, such as high efficiency and light-weight. In particular, considering that many engineering and scientific applications rely on some fundamental algorithms, it is possible to provide algorithm-based fault-tolerant mechanisms in low level and make it application-independent. Previous fault-tolerant mechanisms for matrix computation use row and column checksums, which cannot be directly used in large-scale parallel systems. This paper proposes an algorithm-based fault tolerant approach for matrix multiplication on large-scale parallel systems. The mechanism uses block-checksum which not only meets the requirement of matrix computations on large-scale parallel systems but also reduces the overhead of fault-tolerance compared to traditional schemes based on row and column checksums. In addition, this paper gives method for choosing the size of blocks to achieve balance between accuracy and efficiency. The complexity analysis and examples demonstrate effectiveness and feasibility of our approach.

Loading