next up previous
Next: About this document ...

Jon Calhoun
Towards a More Fault Resilient Mulitgrid Solver

201 N Goodwin Ave
Urbana
IL 61801
jccalho2@illinois.edu
Luke Olson
Marc Snir

Much is known about properties of linear solvers with regard to their stability, convergence rates, complexity, and efficiency, but little is known about their ability to handle bit-flips that can lead to silent data corruptions (SDCs). As supercomputers continue to add more cores to increase the performance of the machine, they are becoming more susceptible to SDCs. Going forward it is paramount that studies on the impact of SDCs on algorithms and applications in widespread use be conducted. This paper looks at the linear solver Algebraic Multigrid in a environment where bit-flips are possible. We propose an algorithmic based detection and recovery scheme that maintains the numerical properties of AMG while maintaining near perfect convergence rates in faulty environments.





Copper Mountain 2014-02-23