The recent advances in parallel computing hardware and parallel algorithms have opened the perspective for Quantum Density Functional Theory (QDFT) codes, such as ONETEP and CASTEP, to compute accurately properties of large molecules or materials interacting with solvents [1]. Solvent contribution to the electronic charge density can be incorporated in QDFT computations by solving either the Poisson Equation (PE) or the Poisson-Boltzmann Equation (PBE) in a domain which contains the modelled quantum system and is characterised by an inhomogeneous electric permittivity. DL_MG, a hybrid parallel (MPI+OpenMP) multigrid solver, was developed as an efficient and robust method for finding the required electrostatic potentials [2].
Performance optimisation of memory or communication bound applications, such as multigrid solvers, requires continuous evaluation of data placement and concurrency as the number of cores per compute node increases and memory hierarchies become more complex. These aspects are even more important if accelerators (GPUs or Xeon Phi) are used.
When using multicore nodes the standard method proposed in many numerical applications to reduce the communication overheads is to use OpenMP for parallelism at node level and MPI for inter-node communication. However, scaling OpenMP towards 30 threads in a non-uniform memory access node proves to be difficult in many applications. Recently, the MPI 3 standard has introduced the ability to use shared memory for MPI ranks on the same node in order to reduce MPI data traffic while preserving the more structured communication environment offered by MPI [3].
In this paper we study the strong scaling for the DL_MG multigrid solver across several hardware platforms. We present a comparative performance analysis between hybrid OpenMP-MPI parallelism and MPI parallelism enhanced with the shared memory introduced by the MPI 3 standard. The multicore CPU node performance is compared with the performance obtained on GPUs.
References:
1. J. Dziedzic, S.J Fox, T. Fox, C.S. Tautermann and C.-K. Skylaris, Large-scale DFT calculations in implicit solvent - a case study on the T4 lysozyme L99A/M102Q protein, Int. J. Quant. Chem. 113 6 (2013)
2. http://www.hector.ac.uk/cse/distributedcse/reports/onetep/onetep.pdf
3. T. Hoefler and J. Dinan and D. Buntinas and P. Balaji and B. Barrett and R. Brightwell and W. Gropp and V. Kale and R. Thakur, MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory, Journal of Computing, May, 2013