The fast multipole method (FMM) is an efficient algorithm for what is known as ``N-body problems''. I will present a new scalable algorithm and a new implementation of the kernel-independent fast multipole method, in which both distributed memory parallelism (via MPI) and shared memory/SIMD parallelism (via GPU acceleration) are employed. I will conclude my talk by discussing the direct numerical simulation of blood flow in the Stokes regime using the FMM. I will describe simulations with 200 million red blood cells, an improvement of four orders of magnitude over previous results.