This work focuses on resilience techniques at extreme scale, dealing with fail-stop and silent errors simultaneously. We present a unified framework and optimal algorithmic solutions to cope with both error sources. Silent errors are handled via verification mechanisms (either partially or fully accurate) and in-memory checkpoints. Fail-stop errors are processed via disk checkpoints. All verification and checkpoint types are combined into computational patterns. We provide a unified model, and a full characterization of the optimal pattern. Our results nicely extend several published solutions.
We instantiate the model for sparse iterative solvers and discuss several application-specific error detection and correction mechanisms, including partial recomputations, orthogonality checks and ABFT.
Joint work with Anne Benoit, Aurelien Cavelan, Massimiliamo Fasi, Julien Langou, Hongyang Sun and Bora Ucar.