View: session overviewtalk overview
08:00 | Reducing Communication Costs in Sparse Matrix-Vector Multiplication SPEAKER: Amanda Bienz ABSTRACT. The sparse matrix-vector multiply (SpMV) is a key component in the solve phase of algebraic multigrid (AMG). The coarse levels of AMG hierarchies often require a large number of small messages to be communicated during each SpMV, yielding large communication costs on levels near the middle of the hierarchy. Furthermore, as linear systems are solved on increasingly large parallel systems, communication among processes dominates the cost of each SpMV. Topology-aware methods, which can map an MPI rank to a physical node, can be used to reduce the costs associated with communication. Values can be redistributed among all processes local to a node before being communicated across the network, reducing the inter-node communication requirements at the cost of additional, less-costly, intra-node communication. |
08:25 | OPTIMIZING COMMUNICATIONS IN MULTI-GRID METHODS USING GPUDIRECT ASYNC SPEAKER: Elena Agostini ABSTRACT. HPGMG is a benchmark for HPC geometric multi-grid methods. A multi-grid solver works on a grid having a fixed total size but different resolutions during the execution so that the amount of computation and communication varies significantly within the same application. NVIDIA developed a CUDA version of HPGMG-FV (Finite Volume) where, according to a threshold, finer (higher) levels are computed on the GPU whereas coarser (lower) levels on the CPU: that hybrid scheme exploits at their best the strengths of both platforms. During the HPGMG-FV execution, there are three main communication phases: boundaries exchange (among a variable number of cluster nodes), interpolation (lower to higher level) and restriction (higher to lower level). By default communications use the Message Passing Interface (MPI). GPUDirect Async is a new technology introduced by NVIDIA in CUDA 8.0 to support mutual direct control between the GPU and the interconnect device, a Mellanox Infiniband HCA in our case. In this paper, we propose two optimized communication schemes involving the GPUDirect Async technology, where the burden of the communication is off-loaded onto the GPU itself. When running a properly modified version of HPGMG-FV, where Async technology is used for only those aforementioned communication phases, the communication overhead reduces up to 26%. |
08:50 | Studying Convergence of Nested Iteration with Range Decomposition SPEAKER: Wayne Mitchell ABSTRACT. This paper studies a low-communication algorithm for solving elliptic partial differential equations on high-performance machines, the nested iteration with range decomposition aglorithm (NIRD). Important features of the NIRD algorithm as well as new improvements over previous implementations are discussed. A convergence proof is outlined, and numerical results are presented for a variety of elliptic problems which include different difficulties such as anisotropy and strong advection. These results show NIRD achieving error within a small factor of that achieved by traditional algorithms but with far less communication. |
09:15 | Multigrid on heterogeneous systems SPEAKER: Lukas Spies ABSTRACT. In recent years, large heterogeneous systems have grown in importance for solving all kinds of problems. While most implementations make use of parallelisation only on a CPU-level, our approach is to develop an efficient hybrid version, incorporating both multiple CPUs and multiple GPUs simultaneously. This allows us to make use of all resources available in modern supercomputers. Overcoming slow communication between the host (CPU) and the device (GPU) is the key to achieve increased performance. In this talk, we discuss various designs as to how a domain can be partitioned for both CPUs and GPUs, and their benefits and shortcomings. Following this, we'll pick one of the designs and look at some performance benchmarks, comparing our hybrid to a CPUs-only version. Finally, we'll take a look at potential benefits of this approach to multigrid. |
09:40 | Finite Element Based Multigrid at Scale SPEAKER: Ulrich Ruede ABSTRACT. Multigrid methods promise to solve PDE at the cost of a few work units, where a work unit is defined as the cost of executing a relaxation of the discrete system. Following the classical notion of Brandt, textbook efficiency is achieved when the number of work units for computing a solution remains below ten. It has been shown that textbook efficiency can be achieved for many essential problems, such as, e.g., scalar elliptic PDE or Stokes-like saddle point systems. Assuming in turn that a work unit costs about 100 Floating Point operations, the ideal cost for solving a discretized PDE is below 1000 Floating point operations per degree of freedom. Given that a supercomputer may perform at a PetaFlops, i.e.~can execute 10^15 floating point operations per second, we expect to solve well-behaved PDE at a rate of a trillion (i.e. 10^12) degrees of freedom per second. A ''tiny'' problem, with say a billion (10^9) unknowns should then be solved in a millisecond. This idealistic prediction cannot be reached. However, it is troublesome to realize that many existing solvers do not only underperform by a moderate factor, but that they fail to reach this performance by factors of three to six orders of magnitude. Such a discrepancy between theoretical knowledge and computational practice cannot be explained by the arguments that appear in the literature, such as communication overhead, the need to optimize the code, or by blaming it on unfavorable memory-hierarchy effects. The reasons for our failure to predict performance from the fundamental mathematical properties of the algorithm are deeper and are caused by fundamental problems in how we design, analyze, implement, and test numerical methods. The talk will try to shed some light on this situation. As a test bed we will use the Hierarchical Hybrid Grids (HHG) multigrid prototyping software that can solve up to $10^{13}$ degrees of freedom for problems. |
10:25 | A Performance Model for Allocating the Parallelism in a Multigrid-in-Time Solver SPEAKER: Ulrike Yang ABSTRACT. The traditional way to numerically solve time dependent problems is to sequentially march through time, solving for one time step and then the next. The parallelism in this approach is limited to the spatial dimension, which is quickly exhausted, causing the gain from using more processors to solve a problem to diminish. One approach to overcome this barrier is to use methods that are parallel in time. These methods have the potential to achieve dramatically better performance compared to time-stepping approaches, but achieving this performance requires carefully choosing the amount of parallelism devoted to space versus the amount devoted to time. Here, we present a performance model that, for a multigrid-in-time solver, makes the decision on when to switch to parallel-in-time and on how much parallelism to devote to space vs. time. In our experiments, the model selects the best parallel configuration in most of our test cases and a configuration close to the best one in all other cases. |
10:50 | The Role of Performance Modeling for Numerical Algorithms SPEAKER: Martin Schulz ABSTRACT. Performance models play an important role in the design and use of numerical algorithms. They can be used to understand design tradeoffs, identify performance problems and optimization opportunities, help map algorithms to suitable hardware architectures, and tune algorithms at runtime. In this talk I will first discuss these opportunities in modeling numerical algorithms and present some of the challenges in creating these models. I will then highlight some of the successes of applying modeling techniques to Algebraic Multigrid methods and will show how they helped improve the performance of the AMG solver in the widely used hypre library. |
11:15 | Understanding Complexity of Evolving High Performance Computing Systems to Predict Performance SPEAKER: Kirk Jordan ABSTRACT. High performance computing (HPC) is a tool frequently used to understand complex problems in numerous areas such as aerospace, biology, climate modeling and energy. However, the evolution of HPC systems with multicore processors and various accelerators, these systems are growing in complexity as the community strived toward Exascale performance. Predicting performance for key applications and kernels would assist in the development of these systems. This talk will highlight the efforts of Hormozd Gahvari to develop such performance predictions for Algebraic Multigrid, a method identify to scale well. We will show some results obtain from Hormozd's work that showed how to better understand future systems. As the system become more complex as I will describe, Hormozd’s formulations can have impact at key decision points in the development of new HPC systems. |
11:40 | Is the Pingpong Communication Model Still Relevant? SPEAKER: William Gropp ABSTRACT. The pingpong test, which measures the time to send a message between two processes, is often used to define a communication performance model for message-passing parallel programs. In this talk, we show that this test is no longer representative, and propose a simple modification that provides a better, more predictive communication performance model. |
12:05 | Preparing Robust Structured Multigrid for Exascale SPEAKER: Andrew Reisner ABSTRACT. Coarse-grid problems present a common challenge to the parallel scalability of structured multilevel solvers. As problems grow coarser, parallel communication costs dominate computation. To balance these costs, we look at incrementally redistributing coarse-grid problems to a subset of involved processors. A predictive performance model is then used to guide this redistribution. In this talk, we present the results of applying this approach to improve the parallel scalability of Black Box Multigrid (BoxMG), a robust multigrid method for discretizations of PDEs on logically structured grids. |
16:30 | Distance Laplacians and Algebraic Multigrid SPEAKER: Raymond Tuminaro ABSTRACT. This talk considers situations where it is may be problematic to directly apply algebraic multigrid (AMG) to the matrix system that one is interested in solving. In these cases, one can instead consider applying AMG to a distance Laplacian matrix in order to generate grid transfer operators within the multigrid hierarchy. The resulting hierarchy is then post-processed by taking the original matrix of interest and projecting it on all levels using the grid transfers obtained with the distance Laplacian. This distance Laplacian matrix, L, has the same sparsity pattern as the original matrix for scalar PDE systems. Off diagonal entries, L_{ij} correspond to the negative reciprocal of the distance between the i and j mesh nodes while diagonal entries are chosen so that the sum of the entries within each matrix row is zero. Two examples will be given where a distance Laplacian strategy is adapted to a specific context. The first example corresponds to single phase porous media flow (reservoir modeling). In this case, the underlying mesh is highly stretched and the equations include large permeability variations. Additionally, the discretization scheme is such that there are multiple degrees-of-freedom (dofs) at some grid nodes (coming from opposite sides of a crack). Depending on the permeability of the crack, these opposite-side dofs might be strongly coupled or weakly coupled. The second example comes from a multiphase flow formulation that allows for discontinuous pressures at interfaces. In this case, a distance Laplacian/AMG approach is used to avoid complications associated with applying AMG to system where the the number of PDE equations per node varies. In both cases, adaption of the distance Laplacian idea is necessary to successfully address difficulties associated with the specifics of each application. |
16:55 | Reduction-based Algebraic Multigrid for Upwind Discretizations SPEAKER: Ben Southworth ABSTRACT. Algebraic multigrid (AMG) is a powerful solver for certain classes of large, sparse linear systems, Ax = b, often arising from the discretization of PDEs. However, convergence theory and most variations of AMG rely on A being symmetric positive definite. Here, a new reduction-based AMG method is developed for upwind discretizations of hyperbolic PDEs. Theory is presented motivating the reduction approach, and connecting it in a big-picture sense to other non-symmetric AMG methods. Two variations are introduced, a scalar algorithm, as well as a block-wise approach targeting higher-order discretizations. Numerical results prove reduction-based AMG to be a highly effective solver for upwind and discontinuous discretizations of advection- and transport-type equations. |
17:20 | Robust AMG interpolation with target vectors SPEAKER: Ruipeng Li ABSTRACT. In this talk, we will present an approach to compute interpolation coefficients in algebraic multigrid (AMG) that can integrate additional test vectors that span the near null space of the global matrix. The presented approach is based on the framework of AMGe, which computes the interpolation coefficients by solving local quadratic minimization problems. The local operators are computed algebraically by defining local extension mappings as done in the element-free AMGe work. So, this work is a natural extension to element-free AMGe to be able to handle an arbitrary number of test vectors. The connections between the direct and indirect interpolation schemes within the proposed approach and the connections to the standard AMG methods will be discussed. With a proper coarse grid, numerical results suggested that compared with standard AMG, this approach can be more robust and efficient for the applications that can provide null-space information such as the linear elasticity problems. |
17:45 | Optimal Interpolation in Algebraic Multigrid Methods SPEAKER: Karsten Kahl ABSTRACT. Algebraic multigrid methods have proven to be reliable and efficient methods for the solution of large sparse linear systems of equations in many different application areas and their theoretical foundation has been studied extensively. Though, there remains a mismatch between practical implementations and theory. On one hand many algebraic multigrid constructions successfully used in practice do not allow for a precise theoretical analysis. And on the other hand there is abstract theory, which at least in the two-grid case, gives exact predictions of the convergence rate, but does not hint at the proper construction of the constituents of the method. In this talk we try to bridge this gap between theory and practice. In this talk we derive, based on the exact two-grid theory, an optimal interpolation operator for two-grid algebraic multigrid methods. We show that it matches the exact convergence rate of the theory and is in agreement with the limiting cases of the theory, namely when the two-grid method becomes an exact solve. As a side-product we obtain a relation between the convergence rate of the optimal two-grid method and the dimension of the coarse-grid space. Equipped with the information on its dimension, we develop a method that locates representatives of the coarse-grid degrees of freedom that allow for a sparse approximation of the optimal interpolation operator. This analysis thus indicates if a practical method, which prescribes the sparsity of interpolation can achieve a near-optimal convergence rate. While the optimal interpolation is an interesting theoretical tool, which reveals a lot of information about the problem and the constituents of the two-grid method, its direct practical use is limited. Thus we finish the talk by presenting ideas on how to integrate the concept of optimal interpolation into the Bootstrap algebraic multigrid framework and present numerical results for a variety of applications. |
18:10 | Performance Analysis of SA-AMG Solver with Additional Near-Kernel Vectors at Coarser Levels SPEAKER: Naoya Nomura ABSTRACT. The smoothed aggregation algebraic multigrid (SA-AMG) method is among the fastest solvers for large-scale linear equations. The SA-AMG method achieves good convergence by generating small-sized matrices from the original matrix problem. However, the convergence of the method can be further improved by setting near-kernel vectors. There are related works on how to set near-kernel vectors adequately, including adaptive smoothed aggregation multigrid. Almost all of the related works uses the same number of near-kernel vectors at all levels. In our research, we evaluate the method of adding additional multiple near-kernel vectors at coarser levels. We have implemented the solver that extracts near-kernel vectors and add them at the fine level and one coarser level. The solver can use different number of near kernel vectors at each level. Especially when the coarser levels' smoother convergence property influence the convergence of the whole solver, this method has possibility to improve the convergence using additional near-kernel vectors at coarser levels. This research investigates the balance between the increasing cost of the coarser levels and the improvement of the convergence by using multiple near-kernel vectors at the coarser levels. |