next up previous
Next: About this document ...

James Elliott
Making numerical algorithms tolerate silent data corruption

Sandia National Laboratories
P O Box 5800
Albuquerque
NM 87185-1320
jjellio3@ncsu.edu
Mark Hoemmen
Frank Mueller

Future extreme-scale computer systems may expose silent data corruption (SDC) to applications, in order to save energy or increase performance. However, resilience research struggles to come up with useful abstract programming models for reasoning about SDC in numerical algorithms. Existing work randomly flips bits in running iterative linear solvers, but this only shows average-case behavior for a low-level, artificial hardware model. Algorithm developers need predictions of worst-case behavior, especially if they plan to use their solvers for simulations that support making high-consequence decisions. Also, since we know so little about how SDC may manifest in future hardware, we think it premature to draw conclusions using a fault model that may have nothing to do with how future computers behave.

We argue instead for a numerical unreliability fault model, where SDC manifests as unbounded perturbations to floating-point data. This has minimal dependence on details of a hardware implementation, and puts SDC in terms that numerical analysts can understand. We apply this model to design iterative linear solvers that can tolerate such perturbations. These solvers depend on a few techniques that we consider generally applicable to all kinds of numerical algorithms. For example, they use inexpensive ``sanity'' checks that bound or exclude error in the results of computations. Given a selective reliability programming model that requires reliability only when and where needed, such checks can make algorithms reliable despite unbounded faults. Sanity checks, and in general a healthy skepticism about the correctness of subroutines, are wise even if hardware is perfectly reliable.

In general, we present a case for a radically different research methodology that merges numerical analysis with systems fault tolerance, and provides algorithm developers with programming models they can use to ensure correctness despite SDC. We solicit this community specifically for feedback, because this challenge requires researchers that are comfortable bridging mathematics and computer science.

This work was supported partly by the RX-Solvers grant from the Advanced Scientific Computing Research program of the U.S. Department of Energy's (DOE) Office of Science, and partly by the Consortium for Advanced Simulation of Light Water Reactors under U.S. DOE Contract No. DE-AC05-00OR22725. Sandia National Laboratories is a multiprogram laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. DOE's National Nuclear Security Administration under Contract DE-AC04-94AL85000.




next up previous
Next: About this document ...
Copper Mountain 2014-02-23