## Trivial Details... ### Matrix Multiplication The initial implementation of matrix multiplication is in a simple triple-loop way, even not considering cache-friendly array indexing. The code looks like ```rust for j in 0..n { // every column of B for i in 0..m { // every row of A for k in 0..k { C[i][j] += A[i][k] * B[k][j] // indexing B by every row makes cache miss a lot } } } ``` I modified the indexing order a little when I was going to optimize its outrageously slow speed... ```rust for i in 0..m { for k in 0..k { for j in 0..n { C[i][j] += A[i][k] * B[k][j] // no cache miss now } } } ``` The reordering makes it about 2x faster (test size is 512^3). Then, I found the rust compiler could usually apply auto-vectorizing optimization to iterations easily by SIMD, so I converted the indexing to iterating and got about **40 x faster**. ### NdArray Permute The permute operation involves the reordering of the actual data, so at least you have to traverse all elements in the ndarray, which indicates a $O(N)$ time complexity. Initially I implement it by a DFS recursive calling. It is very shocking that I find the permutation even consumes more resources than multiplication operation (while this is another story, should not blame for permute). A simple optimization is replacing recursion with loop, and it gets **20x faster** for high dimension ndarray (like 512^3 with more than 4 dims).