## Trivial Details...

### Matrix Multiplication

The initial implementation of matrix multiplication is in a simple triple-loop way, even not considering cache-friendly array indexing. The code looks like

```rust
for j in 0..n { // every column of B
    for i in 0..m { // every row of A
        for k in 0..k {
            C[i][j] += A[i][k] * B[k][j] // indexing B by every row makes cache miss a lot
        }
    }
}
```

I modified the indexing order a little when I was going to optimize its outrageously slow speed...

```rust
for i in 0..m {
    for k in 0..k {
    	for j in 0..n {
            C[i][j] += A[i][k] * B[k][j] // no cache miss now
        }
    }
}
```

The reordering makes it about 2x faster (test size is 512^3).

Then, I found the rust compiler could usually apply auto-vectorizing optimization to iterations easily by SIMD, so I converted the indexing to iterating and got about **40 x faster**.

### NdArray Permute

The permute operation involves the reordering of the actual data, so at least you have to traverse all elements in the ndarray, which indicates a $O(N)$ time complexity. Initially I implement it by a DFS recursive calling. It is very shocking that I find the permutation even consumes more resources than multiplication operation (while this is another story, should not blame for permute). A simple optimization is replacing recursion with loop, and it gets **20x faster** for high dimension ndarray (like 512^3 with more than 4 dims).