A matrix can be reduced to upper triangular banded form
by BLAS-3 block Housholder transformations. Deferring matrix updates, the
algorithm accesses
only to extract blocks and to perform
multiplications
,
, where if
and
have
columns than
the bandwidth of upper triangular
is
. When
is sparse,
block Householder eliminations provide a BLAS-3 method to contruct a
approximation, with
and
orthogonal,
(
),
(
),
, with the size of
constrained by available RAM.
The decomposition is stable, is efficient in terms of cache utilization, and should scale well in distributed parallel computation.
The
approximate (truncated) decomposition can be used to
provide some matrix singular values, to solve linear systems and least
squares problems, and to provide an approximate inverse preconditioner.
Multi-grid applications may be parallel iterations with the full matrix,
as a preconditioner, or in solution of a coarsened problem. Each of these
applications is discussed.
A primary advantage of the block reduction to a banded upper triangular form is that the underlying sparse matrix is accessed only for multiplications by blocks of matrices (sparse matrix dense matrix multiplications). Serial SPMD multiplications run at a significant fraction of peak speed on cache based processors, and should also run well in parallel. Stability of block Householder transformations aids in scalability.
Numerically, the truncated
decomposition is seen to be
particularly efficient in approximating low rank matrices or low rank
matrices added to a matrix with a known factorization.
Some unresolved questions are how to best prepermute , how to best pad
(thick start), and considered as a Krylov method the best restart
strategies (thick restart?, repermutation of
?).
The remainder of the abstract discusses why multiplication of is
likely to be faster than computing
, assuming
sparse,
a dense
vector
a dense matrix with relatively few columns. Assume
is too
large to fit in cache memory.
is typically stored so that it can be
pulled in a stream from RAM.
If
fits in cache, then as each element of
appears in the CPU it
can be matched by the appropriate element of
. If indexing operations
do not take too long the main cost is then the fetch of
from RAM.
Since the fetch of
is streamed, elements of
arrive more or less
at the peak speed of the data bus.
On Intel Xeons, sparse can attain up to about ten per cent of the
advertised peak flop rate. When
becomes too large to fit in cache,
and if
is accessed randomly, then the
computation is dominated by cache misses and slows dramatically. In some
experiments with Intel Xeons,
compted at less than one per cent of
peak processor speed. When multiplying by
with
columns as opposed
to
, each access of an element of
allows
multiplications.
Storage of
should be arranged so that when a given element of
is
accessed, the next required elements of
are accessed. In Fortran,
for
is stored as
so that each row of
is sequentially
stored as a column of
.
In experiments summarized here, we also blocked (column blocks) to
further minimize cache misses. The blocked SPMD
can execute at a
flop rate several orders of magnitude faster than non-blocked
. The
marked speedup of BLAS 3 over BLAS 2 motivates the algorithm development
described in the presentation.