Volume 23, Issue 4 e202300228
RESEARCH ARTICLE
Open Access

Diagonally-Addressed Matrix Nicknack: How to improve SpMV performance

Jens Saak

Jens Saak

Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany

Search for more papers by this author
Jonas Schulze

Corresponding Author

Jonas Schulze

Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany

Correspondence

Jonas Schulze, Max Planck Institute for Dynamics of Complex Technical Systems, 39106 Magdeburg, Germany.

Email: [email protected]

Search for more papers by this author
First published: 20 November 2023

Abstract

We suggest a technique to reduce the storage size of sparse matrices at no loss of information. We call this technique Diagonally-Addressed (DA) storage. It exploits the typically low matrix bandwidth of matrices arising in applications. For memory-bound algorithms, this traffic reduction has direct benefits for both uni-precision and multi-precision algorithms. In particular, we demonstrate how to apply DA storage to the Compressed Sparse Rows (CSR) format and compare the performance in computing the Sparse Matrix Vector (SpMV) product, which is a basic building block of many iterative algorithms. We investigate 1367 matrices from the SuiteSparse Matrix Collection fitting into the CSR format using signed 32 bit indices. More than 95% of these matrices fit into the DA-CSR format using 16 bit column indices, potentially after Reverse Cuthill-McKee (RCM) reordering. Using IEEE 754 double $\mathtt {double}$ precision scalars, we observe a performance uplift of 11% (single-threaded) or 17.5% (multithreaded) on average when the traffic exceeds the size of the last-level CPU cache. The predicted uplift in this scenario is 20%. For traffic within the CPU's combined level 2 and level 3 caches, the multithreaded performance uplift is over 40% for a few test matrices.

1 INTRODUCTION

An operation is called memory-bound, if its performance is limited by the memory bandwidth [Byte/s] of the executing hardware. In that context, it holds that
P 1 P 2 = traffic 2 traffic 1 , $$\begin{equation} \frac{P_1}{P_2} = \frac{\rm {traffic}_2}{\rm {traffic}_1}, \end{equation}$$ (1)
where P i $P_i$ denotes performance [FLOP/s], and traffici [Byte] accounts for all the memory involved. Hence, reducing the traffic should directly lead to a performance improvement. It is often assumed that computing the Sparse Matrix Vector (SpMV) product is memory-bound; see, for example, [1-3]. The biggest contributor to the overall SpMV traffic is the matrix. Therefore, in the following, we present a technique that allows for the reduction of the storage size of a sparse matrix at no loss of information.

The key ingredient of the new storage technique is the observation that many matrices arising in, for example, finite element simulations have a very low matrix bandwidth (under certain permutations). That means, potentially after permutation, all non-zero matrix entries are located close to the matrix diagonal. This motivates storing the indices of these entries relative to the matrix diagonal rather than as an absolute position, which we call Diagonally-Addressed (DA) storage. Due to the small matrix bandwidth, the relative indices may be stored in a smaller (integer) data type. This technique is easily applicable to many sparse formats, for example, Compressed Sparse Rows (CSR), Block CSR (BSR), or (one of the vectors of) Coordinate (COO) storage. In this paper we apply DA storage to the Compressed Sparse Rows (CSR) format; obtaining the Diagonally-Addressed CSR (DA-CSR) format; and compare the SpMV performance against our implementation of CSR as well as the Intel Math Kernel Library (MKL) [4].

DA storage differs from Diagonal (DIA) storage in that the new technique still requires one index per non-zero, depending on the underlying technique, instead of one index per diagonal. Also, it does not impose a diagonal-major order of the entries, or require a potentially padded and full/dense storage for each of the diagonals. The Recursive Sparse Blocks (RSB) format [5] is another data structure for sparse matrices motivated by a cache-efficient and parallel implementation of the SpMV product. It divides a sparse matrix into a tree structure of sparse blocks, whose leaves are iterated in a Z- or Morton-order. The leaf blocks are stored in the COO, CSR, or Compressed Sparse Columns (CSC) format.

RSB was designed for arbitrary sparse matrices, in particular, matrices without an inherent low bandwidth (under certain permutations). RSB allows 16 bit indices as well, but only for its leaf matrices. Meanwhile, DA-CSR has a conceptually simpler non-recursive design, allowing 16 bit indices throughout, which leads to a much lower overhead in terms of Byte per non-zero. Therefore, DA storage does not directly compete with the RSB format, but could be used in the leaf blocks within the RSB format, to allow for an even smaller index type.

The remainder of this paper is structured as follows. Section 2 applies DA storage to the CSR format. Section 3 describes the selection of test matrices. Section 4 describes how to compute the SpMV product using that new DA-CSR format, and measures the performance of SpMV. We conclude the paper in Section 5.

2 DIAGONALLY-ADDRESSED STORAGE

The CSR storage of a matrix A F nrows × ncols $A\in {\mathbb{F}}^{\mathrm{nrows}\ensuremath{\times{}}\mathrm{ncols}}$ comprises three vectors, compare Listing 1, where nnz denotes the number of non-zero entries, oindex_t and iindex_t are integer data types, and scalar_t is an approximation of F $\mathbb {F}$ , for example, IEEE 754 double $\mathtt {double}$ or float $\mathtt {float}$ for F = R $\mathbb {F}=\mathbb {R}$ . The “row pointers” stored in rowptr $\mathtt {rowptr}$ and “column indices” stored in colids $\mathtt {colids}$ do the bookkeeping imposed by only storing non-zero matrix entries.

  • Listing 1: (DA-) CSR storage of a matrix.

  • 1 struct {

  • 2 oindex_t rowptr[nrows+1];

  • 3 iindex_t colids[nnz];

  • 4 scalar_t values[nnz];

  • 5 };

The rth entry 0 rowptr [ r ] < nnz $0 \le \texttt {rowptr}\,[r] &lt; \texttt {nnz}$ is the index into colids $\mathtt {colids}$ and values $\mathtt {values}$ corresponding to the first non-zero of row r, 0 r nrows $0\le r\le \mathrm{nrows}$ . The ith entry colids [ i ] $\mathtt {colids}[i]$ is the column index and values [ i ] $\mathtt {values}[i]$ is the value of the ith non-zero, 0 i < nnz $0\le i&lt;\mathrm{nnz}$ . Let w denote the matrix bandwidth of A = ( a r c ) $A=(a_{rc})$ , i.e. the farthest distance of a non-zero matrix entry from the matrix diagonal,
w : = max { | c r | : a rc 0 } ncols . $$\begin{equation} w:=\max \operatorname{\{}|c-r|:{a}_{\textit{rc}}\ne 0\}\ll \mathrm{ncols}. \end{equation}$$ (2)
For CSR storage, colids[ i ] = c i $\text{\texttt {colids[$i$]}} = c_i$ , which lies in the range [0, ncols). For DA-CSR storage, colids[ i ] = c i r i $\text{\texttt {colids[$i$]}} = c_i - r_i$ , which instead lies in the range [ w , w ] $[-w,w]$ . We illustrate this transformation in the following example.

Example 2.1.The sample matrix shown in Figure 1 has nrows = ncols = 4 $\mathrm{nrows}=\mathrm{ncols}=4$ , nnz = 5, and w = 1 $w=1$ . Its CSR representation is given by

image(3)
while the DA storage replaces colids to become
image(4)
Observe that colids covers its full range in either storage scheme: [0, ncols − 1] for CSR and [ w , w ] $[-w,w]$ for DA-CSR.

Details are in the caption following the image
Sample matrix (left) in CSR storage (middle) and DA-CSR storage (right). Colorful dots represent non-zero entries, gray dots are zero. Whiskers represent (column) indices with respect to a reference line (dashed). CSR, Compressed Sparse Rows; DA, Diagonally-Addressed.

Sometimes it is necessary to reduce the bandwidth of a matrix before DA storage can be applied effectively, which we observe in the next example.

Example 2.2.The matrix GHS_posdef/ldoor from the SuiteSparse Matrix Collection [6] has dimension 952 203 and 46 522 475 pattern entries, distributed over a bandwidth of 686 979. On average, this matrix has 49 pattern entries per row. See Figure 2 (left) for its sparsity pattern. Due to its block structure, the original matrix bandwidth is fairly large. Still, a Reverse Cuthill-McKee (RCM) reordering [7] reduces the bandwidth to about 9100, which is only about 1% of the matrix dimension. Ignoring the colors, the corresponding sparsity pattern would look almost identical to Figure 2 (right). The reduced bandwidth is less than 2 15 = 32 768 $2^{15} = 32\,768$ and therefore allows for the usage of 16 bit column indices in DA storage, while both the matrix dimension (for plain CSR storage) and the original bandwidth (for naive DA-CSR storage) would require 32 bit indices.

Details are in the caption following the image
Sparsity patterns of the matrices GHS_psdef/ldoor (left) as well as Janna/Bump_2911 (right) from the SuiteSparse Matrix Collection [6].
Following Listing 1, the matrix-related traffic amounts to
( nrows + 1 ) · sizeof ( oindex _ t ) + nnz · ( sizeof ( iindex _ t ) + sizeof ( scalar _ t ) ) . $$\begin{equation} (\mathrm{nrows}+1)\cdot \texttt{sizeof}(\texttt{oindex\_t})+\mathrm{nnz}\cdot (\texttt{sizeof}(\texttt{iindex\_t})+\texttt{sizeof}(\texttt{scalar\_t})). \end{equation}$$ (5)
If w may be stored in a smaller (integer) data type than ncols, this allows for a smaller iindex_t to be used. Using an index type half the size nearly halves the bookkeeping traffic.

Example 2.3.The matrix Janna/Bump_2911 from the SuiteSparse Matrix Collection [6] has dimension 2 911 419 and 127 729 899 non-zeros, distributed over a bandwidth of only 31 343 < 2 15 = 32 768 $31\,343 &lt; 2^{15} = 32\,768$ , which is only about 1% of the matrix dimension. On average, this matrix has 44 non-zeros per row. See Figure 2 (right) for its sparsity pattern. Therefore, standard CSR storage requires 32 bit indices for both oindex_t and iindex_t, which require 11 MiB and 487 MiB in total, respectively. DA-CSR allows for 16 bit iindex_t to be used, which requires only 244 MiB, thus reducing the bookkeeping traffic by 1 11 + 244 11 + 487 48.8 % $1 - \frac{11+244}{11+487} \approx 48.8\%$ , or from 4.09 to 2.09 Byte per nnz, irrespective of scalar_t. For 64 bit and 32 bit scalar_t, for example, IEEE 754 double $\mathtt {double}$ and float $\mathtt {float}$ , which in total require 975 MiB and 487 MiB, using DA-CSR instead of CSR results in an overall matrix-related traffic reduction of 1 11 + 244 + 975 11 + 487 + 975 16.5 % $1 - \frac{11+244+975}{11+487+975} \approx 16.5\%$ and 1 11 + 244 + 487 11 + 487 + 487 24.7 % $1 - \frac{11+244+487}{11+487+487} \approx 24.7\%$ , respectively.

For matrices with more than a few non-zeros per row, it is therefore reasonable to ignore the effect of oindex_t, i.e. to assume sizeof ( oindex _ t ) = 0 ${\texttt {sizeof(oindex\_t)}}=0$ . The final percentages of the previous example would then be estimated by 1 6 $\tfrac{1}{6}$ and 1 4 $\tfrac{1}{4}$ . Figure 3 shows this approximate reduction in matrix-related traffic by means of formula (5). Note how smaller iindex_t; i.e. lower bookkeeping traffic; yield better approximations of the factor 1 2 $\tfrac{1}{2}$ observed for dense storage. Recall that by Equation (1) a traffic reduction is tightly coupled with expected performance gains for memory-bound operations.

Details are in the caption following the image
Approximate matrix-related traffic reduction when exchanging the data types used to store the matrix scalars (horizontally) or column indices (vertically) of (DA-) CSR storage, as well as dense storage (no indices required). DA, Diagonally-Addressed; CSR, Compressed Sparse Rows.

While the goal of multi-precision algorithms is to exchange scalar_t for a smaller data type (as in, for example, [8]), i.e. traversing the rows of Figure 3, DA storage focuses on iindex_t, i.e. traversing the columns of Figure 3. However, as the main motivation of multi-precision algorithms is the memory bottleneck, DA storage is expected to enable even larger speedups in that context. CSR using 64 bit scalars and 32 bit indices merely allows for a 3 2 × $\tfrac{3}{2}\times$ performance speedup when switching to 32 bit scalars. Meanwhile, if the matrix has a representation in DA-CSR using 16 bit indices, the expected speedup is 5 3 × $\tfrac{5}{3}\times$ . This speedup is much closer to the 2 × possible for dense storage, when using a scalar type half the size.

3 SELECTION OF MATRICES

The SuiteSparse Matrix Collection [6] contains 1367 square matrices having a CSR representation using 32 bit indices and a full structural rank. Only 993 of these matrices (72.6%) have a dimension less than 215, i.e. fit into CSR using 16 bit column indices. However, for 1302 matrices (95.2%) there exists a permutation that reduces the matrix bandwidth to below 215, such that these matrices fit into DA-CSR using 16 bit column indices. These are the matrices we select for further investigation.

Some of the investigated matrices are already stored in a bandwidth-reduced way. We applied an RCM reordering [7] to the ones that are not. Unfortunately, our implementation of the RCM permutation has not been able to sufficiently reduce the bandwidth of two matrices (Janna/Long_Coup_dt0 and Janna/Long_Coup_dt6), which reduces the number of matrices to 1300 (95.1%).

4 SPARSE MATRIX VECTOR PRODUCT

Let A denote a matrix, x and y be vectors, and α and β be scalars. The SpMV product denotes the operation y α A x + β y $y \leftarrow \alpha A x + \beta y$ , which requires
W : = 2 nnz + 2 nrows $$\begin{equation} W := 2\mathrm{nnz}+ 2\mathrm{nrows} \end{equation}$$ (6)
floating-point operations of work W [FLOP]. The performance P [FLOP/s] is then defined as the ratio of work W and runtime t, where t denotes the runtime measured in elapsed time. The relative performance w.r.t. some baseline is computed via
P candidate / P baseline = t baseline / t candidate , $$\begin{equation} P_{\text{candidate}}/P_{\text{baseline}}= t_{\text{baseline}}/t_{\text{candidate}}, \end{equation}$$ (7)
assuming W candidate = W baseline $W_{\text{candidate}}=W_{\text{baseline}}$ . The traffic [Byte] of computing the SpMV accounts for x and y on top of the three components of A in (DA-) CSR storage, compare Listing 1 and formula (5). The throughput [Byte/s] is given by the ratio of traffic and t, and the relative throughput is then computed via
traffic candidate traffic baseline · t baseline t candidate , $$\begin{equation} \frac{\rm {traffic}_{\text{candidate}}}{\rm {traffic}_{\text{baseline}}} \cdot \frac{t_{\text{baseline}}}{t_{\text{candidate}}} , \end{equation}$$ (8)
which is a scaled form of the relative performance. Refer to Figure 3 for typical and approximate expected traffic ratios. In the following, we aim to verify the predicted 6 5 × $\tfrac{6}{5}\times$ speedup when replacing 32 bit column indices by 16 bit ones.

4.1 Implementation details and methodology

A prototypical implementation of the SpMV product for a matrix in DA-CSR format is given in Listing 2. Instead of reversing the index translation in the innermost loop, i.e. computing oindex _ t col = row + colids [ i ] $\texttt{oindex\_t}\hspace*{0.1667em}\mathrm{col}=\mathrm{row}+\mathrm{colids}[\mathrm{i}]$ , we instead compute a shifted view xshift $\mathtt{xshift}$ into the factor  x $\mathtt {x}$ one level up. This replaces nnz oindex_t-additions by nrows pointer-additions. Recall that in C/C++ the memory access x [ row + col ] $\mathtt {x[row + col]}$ is equivalent to ( x + ( row + col ) ) $\mathtt {*(x + (row + col))}$ . Applying associativity to the computation of the pointer address, we see that this access is also equivalent to ( ( x + row ) + col ) $\mathtt {*((x + row) + col)}$ and xshift [ col ] $\mathtt {xshift[col]}$ .

  • Listing 2: SpMV for DA-CSR

  •  1 // Input: oindex_t *rowptr, nrows;

  •  2 //  iindex_t *colids;

  •  3 //  scalar_t *values, *x, *y, alpha, beta;

  •  4 // Output: scalar_t *y;

  •  5 for (oindex_t row = 0; row < nrows; ++row) {

  •  6 scalar_t accumulator = 0;

  •  7 scalar_t *xshift = x + row;

  •  8 for (oindex_t i = rowptr[row]; i < rowptr[row+1]; ++i) {

  •  9 iindex_t col = colids[i];

  • 10 scalar_t val = values[i];

  • 11 accumulator += val * xshift[col];

  • 12 }

  • 13 y[row] = alpha * accumulator + beta * y[row];

  • 14 }

Remark 4.1. (Non-Square Matrices)For tall matrices, i.e. nrows > ncols, xshift $\mathtt{xshift}$ points to memory outside x $\mathtt {x}$ , and must therefore never be dereferenced directly. Within Listing 2 however, it will only be dereferenced at an offset col $\mathtt {col}$ that yields a memory address within x $\mathtt {x}$ .

The code has been compiled with GCC 10.3.0 using -O3 -NDEBUG -mavx2 -mfma. The benchmarks have been run on an Intel Xeon Skylake Silver 4110 running CentOS 7.9.2009, with threads pinned using taskset. Runtime measurements have been taken using nanobench [9] with minEpochTime $\mathtt {minEpochTime}$ set to 100 ms, minEpochIterations $\mathtt {minEpochIterations}$ and warmup $\mathtt {warmup}$ both set to 10, using the minimum over 11 epochs.

We measured the performance of a naive implementation (with nnz additions instead of nrows) as well as several implementations akin to Listing 2, optionally using OpenMP with two, four, six, or eight threads, both for CSR and DA-CSR matrices. Among all implementations executed on the given hardware, the best performing ones were the naive implementation, the one using three accumulators, and the one using a single AVX2 accumulator (four scalars wide) accessing values $\mathtt {values}$ using unaligned load instructions. In the following, for each matrix, and each storage format tested, we select the implementations and number of threads yielding the highest performance.

For the MKL [4] implementation of the CSR format, we measured single-threaded as well as multithreaded performance. Again, for each matrix we select the number of threads yielding the highest performance.

4.2 Numerical results

For traffic within the size of the L1 cache, a single thread yields the best performance. Up to about 100 KiB, which is well within the size of a single L2 cache, the optimum number of threads increases gradually. For traffic larger than that, the maximum number of threads yields the best performance. This behavior is irrespective of the matrix format and the implementation vendor (ourselves or MKL [4]).

Our implementation of the SpMV product for the CSR format using 32 bit indices performs about the same as the MKL [4], see Table 1. Figure 4 shows the comparison of DA-CSR using 16 bit column indices to CSR. Using DA-CSR shows almost no change for traffic within the combined size of the L2 caches, i.e. up to 8 · 1 MiB $8\cdot 1\ \text{MiB}$ . For traffic larger than that, up to the combined size of all caches, i.e. up to about 8 + 11 MiB, we observe a larger than 1.4 × speedup. For traffic beyond that, we observe an average speedup of about +17.5%, which is reasonably close to the expected +20%. However, the throughput drops slightly, indicating some unused potential on the given hardware. See Table 2 for the comparison of DA-CSR to MKL [4].

TABLE 1. Average relative performance of our best SpMV implementation for CSR using 32 bit indices w.r.t. MKL [4] as the baseline. Values >1 mean we are faster.
Traffic Singlethreaded Multithreaded
L1d 1.131 1.129
L2 1.073 1.044
L3 1.012 1.011
Large 0.982 0.994
  • Abbreviations: CSR, Compressed Sparse Rows; MKL, Math Kernel Library; SpMV, Sparse Matrix Vector.
TABLE 2. Average relative performance of our best SpMV implementation for DA-CSR using 16 bit column indices w.r.t. MKL [4] as the baseline. Values >1 mean we are faster.
Traffic Single-threaded Multithreaded
L1d 1.103 1.098
L2 1.073 1.055
L3 1.052 1.032
Large 1.110 1.172
  • Abbreviations: CSR, Compressed Sparse Rows; DA, Diagonally-Addressed; MKL, Math Kernel Library; SpMV, Sparse Matrix Vector.
Details are in the caption following the image
Relative performance and throughput of SpMV using the DA-CSR format with 16 bit column indices w.r.t. CSR using 32 bit column indices as the baseline (iso-scalar). The sizes of the L1d, L2, and L3 CPU caches are marked with vertical lines (left to right). CSR, Compressed Sparse Rows; DA, Diagonally-Addressed; SpMV, Sparse Matrix Vector.

Remark 4.2. (CSR using 16 bit column indices)Recall that 933 matrices have a direct representation in CSR using smaller column indices. The DA-CSR format performs on par with CSR using the same index types for these matrices.

5 CONCLUSION AND OUTLOOK

DA storage allows to nearly halve the bookkeeping traffic in sparse matrix storage formats, when the matrix bandwidth allows for an index type half the size. On the hardware used, DA-CSR storage with 16 bit column indices improves the multithreaded SpMV performance over CSR storage with 32 bit column indices by more than 17%, for both our implementation and MKL [4] if the traffic exceeds the size of the L3 cache of the CPU. Meanwhile, DA-CSR performs no worse than CSR when using the same data types.

ACKNOWLEDGMENTS

Open access funding enabled and organized by Projekt DEAL.

    DATA AVAILABILITY STATEMENT

    The source code is available at:

    https://doi.org/10.5281/zenodo.8104335

    The visualizations in this paper have been created using TikZ [10] and Makie.jl [11]. The SpMV performance measurements for the reported experiments are available at:

    https://doi.org/10.5281/zenodo.7551699

    • 1 The final entry rowptr [ nrows ] $\texttt{rowptr}[\mathrm{nrows}]$ is set to nnz for ease of use.
    • 2 Technically, nnz refers to the number of pattern entries, which contains non-zeros as well as explicitly stored zeros.
    • 3 Our implementation yields a bandwidth of 9120, while the SuiteSparse Matrix Collection [6] reports 9134.
    • 4 Note that dense storage may be seen as having sizeof ( oindex _ t ) = sizeof ( iindex _ t ) = 0 $\texttt{sizeof}(\texttt{oindex\_t})=\texttt{sizeof}(\texttt{iindex\_t})=0$ , i.e. having zero bit of bookkeeping (per nnz).
    • 5 Eventually, we are interested in using DA storage when solving linear systems. We thus take full structural rank as a proxy for regularity, as the collection's metadata does not contain the numerical rank for all the matrices. Consequently, our selection of matrices contains irregular matrices as well.
    • 6 https://www.mpi-magdeburg.mpg.de/cluster/mechthild
    • 7 https://www.man7.org/linux/man-pages/man1/taskset.1.htmland https://github.com/util-linux/util-linux/blob/master/schedutils/taskset.c

    The full text of this article hosted at iucr.org is unavailable due to technical difficulties.