Enhancing the Cache Performance of Multigrid Codes on Structured Grids

Markus Kowarschik

System Simulation Group
Department of Computer Science
University of Erlangen-Nuremberg, Germany
markus.kowarschik@cs.fau.de

Modern computer architectures use hierarchical memory designs in order to hide the latencies of accesses to main memory components, which are dramatically slow in contrast to the floating-point performance of the CPUs. Current memory designs commonly involve several levels of cache memories, which can be accessed up to a hundred times faster than main memory. It is well-known that efficient program execution can merely be achieved, if the codes respect the hierarchical memory design. Unfortunately, today's compilers are still far away from automatically performing code transformations like the ones we apply in order to achieve remarkable speedups. As a consequence, much of this optimization effort is left to the programmer.

Another observation is that iterative methods, like e.g. multigrid, are characterized by successive sweeps over data sets, which - for representative problems in science and engineering - are much too large to fit in cache. Therefore, conservative implementations of such algorithms often reach disappointing execution speeds, which are far away from the theoretically available peak performances of the machines under consideration.

In this paper we present techniques in order to enhance the cache efficiency of multigrid methods for variable-coefficient problems on regular mesh structures. These techniques comprise both data layout optimizations and data access optimizations. Data layout optimizations are code transformation techniques which change the data arrangement in memory, e.g. array padding and the introduction of cache-aware data structures. In contrast, data access optimizations modify the order in which data are retrieved in the course of the computation. Loop blocking (tiling), for example, belongs to the class of data access optimizations.

A variety of performance results are provided, showing both profiling data and the speedups which can be achieved by applying these kinds of optimization techniques. It can be observed that in most cases speedup factors of 2-3 can be obtained.

For further details, we would like to refer the reader to the web page of our DiME project (Data-local iterative methods for the efficient solution of partial differential equations).