# Efficiency Demonstration Notes of efficiency when programming (general advice, not only for `blas-array2`): - Row/col-major layout. One of two dimensions in matrix should be contiguous (stride of the dimension is one). > This note may seems to be common sense, but will be useful when applying matrix multiplication to sub-tensors, or problem of tensor products that can be reduced to matrix multiplication. For example, $A_{ijkl} \rightarrow_{ik} A_{jl}$ is row-major, $A_{ijkl} \rightarrow_{jl} A_{ik}$ is col-major, but $A_{ijkl} \rightarrow_{il} A_{jk}$ is nither row/col-major. So it is important to avoid such cases when deriving formulas involving tensor products. - Avoid memory allocation and outplace add-assign. > This will not affect efficiency a lot, but could be avoided if calling API correctly. To be more specific, it is better to call > ``` > gemm(A, B, out = C, beta = 1) > ``` > instead to call the following form, though more intutive but involves unnecessary allocation and add-assign of `C` > ``` > C += gemm(A, B) > ``` We benchmark efficiency of `blas-array2`, `ndarray` and `faer` on [AMD 7945HX](https://www.amd.com/en/products/processors/laptop/ryzen/7000-series/amd-ryzen-9-7945hx.html). Estimated theoretical performance is 1126.4 GFLOPS. Related code is in [`demo-efficiency`](https://github.com/ajz34/blas-array2/tree/main/demo-efficiency) directory in repository. The problem to be benchmarked is $$ C_{ij} = \sum_{mk} A_{mik} B_{mkj} \quad (\texttt{DGEMM}, 10 \times 2560 \times 2560) $$ and $$ C_{ij} = \sum_{mk} A_{mik} A_{mkj} \quad (\texttt{DSYRK}, 10 \times 2560 \times 2560) $$ Main purpose of this benchmark, is to show that cost of `blas-array2` itself (manipulation and generation of `ndarray` views, flags, stride length) is considerably small. Efficiency is mostly affected by BLAS backend, instead of representation of data. ## DGEMM | crate | API call | backend | GFLOPS | |--|--|--|--| | `blas-array2` | inplace | AOCL | 964.1 | | | | OpenBLAS | 947.6 | | | | MKL | 768.0 | | | outplace add-assign | AOCL | 790.3 | | | | OpenBLAS | 641.3 | | | | MKL | 630.3 | | `ndarray` | outplace add-assign | AOCL | 780.8 | | | | OpenBLAS | 636.5 | | | | MKL | 626.6 | | `faer` | inplace | faer | 844.7 | ## DSYRK | crate | API call | backend | GFLOPS | |--|--|--|--| | `blas-array2` | inplace | AOCL | 735.6 | | | | OpenBLAS | 669.2 | | | | MKL | 594.1 | | | outplace add-assign | AOCL | 541.5 | | | | OpenBLAS | 456.0 | | | | MKL | 487.9 | | `ndarray` | outplace add-assign | AOCL (GEMM) | 408.3 | | | | OpenBLAS (GEMM) | 324.9 | | | | MKL (GEMM) | 340.4 | | `faer` | inplace | faer (GEMMT) | 670.0 | ## Version info - AOCL 4.2 - OpenBLAS 0.3.27 - MKL 2024.0