The original TPABLO algorithms are a collection of algorithms which compute a symmetric permutation of a linear system such that the permuted system has a relatively full block diagonal with relatively large nonzero entries. This block diagonal can then be used as a preconditioner. We propose and analyze three extensions of this approach: we incorporate a nonsymmetric permutation to obtain a large diagonal, we use a more general parametrization for TPABLO, and we use a block Gauss-Seidel preconditioner which can be implemented to have the same execution time as the corresponding block Jacobi preconditioner. Since our approach allows for efficient use of level 3 BLAS operations, it outperforms direct solvers and rivals standard ILU preconditioners on many test problems on a single processor system, while having good potential for efficient parallelization.