# Interpreting Benchmarks (version 0.1.5) I'm using github style markdown here to present tables. Headline figures for AMD A10: | Test | Time | Note | | :--- | ---: | :--- | | gf4 mul/ref | 526.49 ns | Reference version of GF(16) mul | | gf4 mul/good | 544.66 ns | Slightly worse than reference | | gf4 mul/mull | 275.46 ns | Long multiply | | gf4 mul/mull-reduce | 3.5463 us | Long mul + modular reduction | | gf4 inv/ref | 153.82 ns | Reference version of inv | | gf4 inv/good | 38.776 ns | Much better than reference | | gf8 mul/ref | 112.05 us | Reference version of GF(256) mul | | gf8 mul/good | 137.36 us | Slightly worse than reference | On Aarch64 (ODROID N2): | Test | Time | Note | | :--- | ---: | :--- | | gf4 mul/ref | 1.2899 us | Reference version of GF(16) mul | | gf4 mul/good | 1.2116 us | Slightly better than reference | | gf4 mul/mull | 711.73 ns | | | gf4 mul/mull-reduce | 6.2504 us | | | gf4 inv/ref | 245.31 ns | Reference version of inv | | gf4 inv/good | 64.380 ns | Much better than reference | | gf8 mul/ref | 184.77 us | Reference version of GF(256) mul | | gf8 mul/good | 294.09 us | Much worse than reference | ## Assembly output (ARM) Using `cargo asm` to examine the GF(28) multiply code, I see nothing unusual: ```Assembly ::mul: ldr x8, [x0] // addr of log table and x9, x1, #0xff // operand a and x10, x2, #0xff // operand b ldrsh x9, [x8, x9, lsl, #1] // log[a] ldrsh x8, [x8, x10, lsl, #1] // log[b] ldr x10, [x0, #48] // addr of exp_entry add x8, x8, x9 // log a + log b ldrb w0, [x10, x8] // exp[sum] ret ``` The important thing to note here is that there are 5 memory accesses. Two are for getting addresses of the `exp` and `log` tables, and then three for looking up values in those tables. There's also a bit of overhead applying a mask of `0xff` to the operands. I'm not sure why that should be, since the inputs are of type `u8`. The assembly output on `x86_64` is similar. On both platforms, it seems that the costs of memory accesses is (broadly speaking) higher than the cost of calculating the product using bit manipulation and loop counting. Also, the extra masking that happens in the ARM code is probably contributing somewhat to worse performance. ## Possible improvement: static tables We can remove two memory lookups by storing static tables within the library. I will implement this for the fields in `guff::good`, but I want to avoid cluttering up (bloating) the library with lookup tables that probably won't be used. I can use Rust's `feature` mechanism to make static tables optional. I'll need to do a bit of refactoring to make that work. Another, related bit of refactoring will be moving the particular implementation of fields out of `guff::good` and into a new module `guff::tables`. As an interim measure, so as to avoid needing to refactor anything too soon, I will start adding static tables into the `guff::good` module. I won't include the code to generate those tables at first, though. In the longer term, though, I will: * write Rust code for generating the tables (port from C code I have already written) * move static tables into a new module and allow them to be selected as features The `guff::good` module will continue to act as a kind of switchboard, plugging into whatever static tables are available, or falling back to dynamically-generated tables if not. ## It's not all bad I was a little bit surprised that the "accelerated" multiply routines were, in most cases, worse than the reference implementation. However, I'm not too concerned about it right now. For one thing, we probably shouldn't be too focused on how fast we can do single multiply operations. It's fine for benchmarking, but in reality, most users will want to work with vectors and matrices. In that case, looking up table addresses (from some memory location) will only need to be done once, at the start of each vector or matrix operation. When this is amortised over length of the vector or size of the matrix, it will mostly become negligible. The second reason for not being too worried is that for smaller field sizes, it's easy for any potential gains with table lookups to get wiped out by the practicalities of memory accesses. When I move on to implementing table lookups for GF(216) and GF(232), I expect that the "good" implementations will start to pull ahead. Thirdly, and finally, my ultimate destination with this library is to include SIMD code for working on vectors and matrices. I've already implemented some of this in assembly, and the results are quite impressive. The ARM NEON code running on a moderately powerful armv7 board (ODROID XU4) already outperforms my best software-based (single) multiply routine running on my AMD A10 desktop. Though obviously, I shouldn't read too much into that, since, as I've said above, benchmarking single multiply throughput isn't really that meaningful. ## Updates I will probably add more documentation like this from time to time. I won't necessarily do it every release, though. Where I do include benchmarks, I will use the official release code so that others can reproduce them.