A recent proof-of-principle study proposes an energy- and charge-conserving, fully implicit particle-in-cell (PIC) algorithm for multi-scale, full-f kinetic, electrostatic simulations in one dimension [1]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver and is capable of taking very large time steps without loss of numerical stability or accuracy. This is in contrast with the explicit time integration approach (leap-frog) employed by classical PIC algorithms, which feature stringent temporal and spatial resolution constraints to avoid numerical instabilities [2]. By enforcing nonlinear convergence of the field-particle equations and ensuring energy and charge conservation to numerical round-off, the fully implicit PIC algorithm improves on earlier implicit PIC approaches (e.g., implicit moment [3,4] and direct implicit [5,6] methods), which failed to converge the nonlinear residual and as a result developed large numerical errors in long-time simulations [7], with significant self-heating/cooling often observed.
A fundamental feature of the method is the nonlinear elimination of particle equations, which employs the enslavement of particle orbits to the field equations. As a result of particle enslavement, it segregates particle orbit integrations from the field solver, while remaining fully self-consistent. This provides great flexibility, and dramatically improves the solver efficiency by decreasing the number of degrees of freedom of the associated nonlinear system. However, it requires a particle push per nonlinear residual evaluation, which makes the particle push the most time-consuming operation in the algorithm. This renders the implicit algorithm ideally suited for emerging heterogeneous architectures which combine CPUs and GPUs [8]. Thus, the field JFNK solver can remain on the CPU, while the particle mover (which is naturally data-parallel and compute-bounded) stays on the GPU. Although the adaptivity of the particle mover creates load imbalances and dynamic control flows (which pose a challenge for the GPU), we demonstrate that a highly efficient GPU implementation of the implicit particle mover (using CUDA) is possible [8]. We obtain 300 to 400 GOp/s (counting floating-point, integer and special function operations) using single precision, depending on the electrostatic PIC model of choice (Ampere vs. Poisson). This is about 20(Nvidia GeForce GTX580), and about 200 to 300 times faster than a single-CPU (Intel Xeon@3.16GHz) serial implementation.
We have implemented an efficient mixed-precision, hybrid CPU-GPU implementation of the implicit PIC algorithm with a defect-correction approach [8]. In this version of the hybrid algorithm, we keep the JFNK solver on the CPU in double precision, while the particle mover is implemented on the GPU in both double precision (to evaluate the Jacobian right hand side in Newton's method) and single precision (to evaluate the Jacobian-free matrix-vector products). As a result, many single precision particle-push evaluations are used to accelerate convergence to a double-precision result. In a challenging long-timescale ion acoustic wave simulation, the mixed-precision hybrid CPU-GPU solver is shown to over-perform the double-precision CPU-only serial version by a factor of &sim#sim;100, without apparent loss of robustness or accuracy.
[1] G. Chen, L. Chacón, D. C. Barnes, J. Comput. Phys. 230, (2011). [2] C. K. Birdsall, A. B. Langdon, Plasma Physics via Computer Simulation, McGraw-Hill, New York, 2004. [3] R. J. Mason, J. Comput. Phys., 41 (2) (1981) 233–244. [4] J. U. Brackbill, D. W. Forslund, Multiple time scales, Academic Press, 1985. [5] A. Friedman, A. B. Langdon, B. I. Cohen, Plasma Phys. Controlled Fusion, 6 (6) (1981) 225 – 36. [6] A. B. Langdon, D. C. Barnes, Multiple time scales, Academic Press, 1985. [7] B. I. Cohen, A. B. Langdon, D. W. Hewett, R. J. Procassini, J. Comput. Phys., 81 (1) (1989) 151–168. [8] G. Chen, L. Chacón, D. C. Barnes, J. Comput. Phys. submitted (2011).