naumov_maxim__

Maxim Naumov
AmgX: Algebraic Multigrid and Preconditioned Iterative Methods on GPUs

NVIDIA
2701 San Tomas Expressway
Santa Clara
CA
95050
mnaumov@nvidia.com
Marat Arsaev
Patrice Castonguay
Jonathan Cohen
Julien Demouth
Joe Eaton
Simon Layton
Nikolay Markovskiy, Nikolai Sakharnykh, Robert Strzodka, Zhenhai Zhu

The solution of large sparse linear systems arises in many applications, such as computational fluid dynamics and oil reservoir simulation. In realistic cases the matrices are often so large that they require large scale distributed parallel platforms to obtain the solution of interest.

In this talk we discuss the AmgX library, which encapsulates distributed algebraic multigrid and preconditioned iterative methods that take advantage of multiple-GPUs. AmgX is designed to fully utilize the large degree of parallelism with a single GPU, as well as scale to large numbers of GPUs connected via in-node buses and inter-node networks. AmgX fully parallelizes both the setup and solve phases required to implement complex iterative multilevel solvers, and attempts to increase performance without significantly weakening numerical convergence. In this talk, we focus on the parallel algorithms, their mapping to massively parallel computer architectures, such as GPUs, as well as the impact of many of the latest CUDA features, such as Unified Virtual Memory (UVA), Multi Process Service (MPS) and stream priorities.

The AmgX library implements both classical and aggregation based algebraic multigrid methods with different selector and interpolation strategies. The library also contains many of the standard and flexible preconditioned Krylov subspace iterative methods, which can be used as standalone or outer solvers for the algebraic multigrid. A variety of smoothers and preconditioners, including block-Jacobi, Gauss-Seidel, and incomplete-LU factorization has also been developed. The parallelism in the aggregation scheme exploits parallel graph matching techniques, while the smoothers and preconditioners often rely on parallel graph coloring algorithms. A highlight of the library is the full configurability of deep solver hierarchies in which the outer solver uses other solvers as preconditioners, which themselves can also be preconditioned by other user configured solvers.

The AmgX library takes advantage of multiple-GPUs and allows handling of very large sparse linear systems that fit into the aggregate memory of all GPUs present in the system. It has its own memory management system that allows it to avoid additional synchronization points during the computation. Moreover, it implements a thread manager which may take advantage of HyperQ and CUDA stream priorities to offload different tasks to the GPU. Thread and stream priorities are used to ensure that the tasks on the critical path are always executed first, which is important to achieving good performance.

In the distributed environment the AmgX library requires the matrix to be partitioned using a graph partitioner and uses techniques that rely on rings of nearest neighbors to keep track of communication. Only the required halo elements are communicated across different nodes. The latency of these transfers is hidden by overlaying communication and computation when possible. Moreover, if the problem becomes too small to fill multiple GPUs with work, consolidation of smaller problems is performed onto fewer GPUs, which again allows the library to minimize communication costs while fully taking advantage of computational resources at hand. Wherever possible, AmgX also takes advantage of advanced features like cudaIpc and GPUDirect to accelerate inter-GPU communication.

The algebraic multigrid algorithm implemented in the AmgX library achieves 2-4x speedup on a single GPU against a competitive commercial implementation on the CPU. It scales well across multiple nodes sustaining this performance advantage. The AmgX library has been integrated into Ansys Fluent 15.0 and has been shown to speed up total simulation time by about 2x for coupled incompressible unsteady flow calculations while delivering same results as the CPU version.