On future exascale computing systems, it is expected that incorrect functioning of the hardware (referred as faults), may become the norm rather than an exception. A particularly insidious case is that of soft faults, which are transient (temporary) errors that do not cause the program to crash or halt but may nevertheless go undetected. For example, bit flips during a computation or in cache or memory are typical soft faults.
For soft faults, we show how to use the idea of self-stabilization, which originates in work by Dijkstra (1974) for problems in distributed control, to make fault-tolerant iterative solvers. At a high level, a self-stabilizing system is one that, starting from an arbitrary state (valid or invalid), reaches a valid state within a finite number of steps. This property imbues the system with a natural means of tolerating soft faults.
We have written an initial paper on our approach [1]. Below, we summarize the paper's key ideas.
Our basic recipe for applying self-stabilization to numerical algorithm design is roughly as follows. First, we regard the execution of a given solver algorithm as the ``system.'' Its state consists of some or all of the variables that would be needed to continue its execution. Then, we identify a condition that must hold at each iteration for the solver to converge. When the condition holds, the solver is in a ``valid'' state; otherwise, it is in an invalid state. Finally, we augment the solver algorithm with an explicit mechanism that will bring it from an invalid state to a valid one within a guaranteed finite number of iterations. Importantly, the mechanism is designed so that the solver algorithm need not detect a fault explicitly. Rather, it is designed to cause a transition from an invalid to valid state automatically.
We have thus far devised two proof-of-concept self-stabilizing iterative linear solvers. One is a self-stabilized variant of steepest descent (SD) and the other is a variant of the conjugate gradient (CG) method. Our self-stabilized versions of SD and CG do require small amounts of fault-detection, e.g., we may check only for NaNs and infinities. Otherwise, under specific assumptions on the type and rates of transient soft faults, we can show formally that our modified solvers are self-stabilizing.
To get an idea of how this process works, consider the case of CG. Recall that CG constructs A-orthogonal search directions that span a Krylov subspace. In each iteration, it updates the solution estimate by taking a step of optimal size in the computed search direction. A fault, when it occurs, can
To illustrate the fault tolerance property of self-stabilized algorithms, we test our self-stabilized conjugate gradient experimentally by analyzing its convergence and overhead for different types and rates of faults in selective reliability mode, where we assume a small fraction of computation can be done in guaranteed reliable mode. We present a theoretical analysis relating amount of reliable computation required, rate of faults and problem being solved.
Beyond the specific findings for the case of self-stabilized conjugate gradient, we believe self-stabilization has promise to become a useful tool for constructing resilient solvers more generally.