# Interpreting Benchmarks (version 0.1.7)
I'm using github style markdown here to present tables.
New benchmarks added for GF(216) field.
On AMD A10:
| Test | Time | Note |
| :--- | ---: | :--- |
| gf16 mul/ref | 144.84 us | Reference GF(2^16) mul |
| gf16 mul/good | 478.20 us | Much worse than reference |
| gf16 inv/ref | 14.833 us | Reference inv |
| gf16 inv/good | 558.80 ns | Much better than reference |
| gf16 div/ref | 3.8748 ms | Reference GF(2^16) div |
| gf16 div/good | 599.82 us | Much better than reference |
On Aarch64 (ODROID N2):
| Test | Time | Note |
| :--- | ---: | :--- |
| gf16 mul/ref | 221.20 us | Reference GF(2^16) mul |
| gf16 mul/good | 1.1230 ms | Much worse than reference |
| gf16 inv/ref | 22.151 us | Reference inv |
| gf16 inv/good | 1.5252 us | Much better than reference |
| gf16 div/ref | 5.9635 ms | Reference GF(2^16) div |
| gf16 div/good | 1.4071 ms | Much better than reference |
## Assembly output (ARM)
Nothing unusual. I saw two bounds checks on mod reduce lookups (which
I eliminated) but none on MULL lookups. Probably because the compiler
determined that the lookup couldn't generate an invalid index. Nice.
## Analysis
With my C version of the same code, I do actually get faster results
with table lookups:
Test 0: long (modular) multiply (u16 version)
Test 0: -> 59.36284 M*/s
Test 16: 8-bit x 4-bit straight *, l-r on nibbles, fast shift w/o init
Test 16: -> 130.75171 M*/s
There are some differences, though:
* I use separate l-r tables
* I look up the l-r tables directly using indices constructed like `a3 | b1`
By using separate l-r tables, I would eliminate one shift on output
per rmull lookup, saving 4 instructions.
By using operands that are pre-shifted, with nibbles in the high four
bits, I would save a further 8 shifts.
Whether either of these will improve the situation enough to bring the
"good" mul implementation within striking distance of the reference
one isn't clear to me, though. I suspect it won't be enough.
Actually, looking at the generated assembly, it appears that the
compiler has already optimised the array indexing for me.
### What's causing the discrepancy between Rust, C?
Since I can't see anything that's very inefficient in the table lookup
code, the other possibility is that the compiler has optimised the
*reference* code better than my C compiler has.
Right now, I don't have a way of looking at the assembly output
since the generic code isn't concretely instantiated in the library.
### Can other optimisations do any better than the reference code?
It should be possible. I chose this particular implementation of
GF(216) to begin with because it has the smallest memory
footprint and because a similar technique can be applied to
GF(232). Also, while a more obvious approach might have
been to use, say, full multiplication lookup tables or log/exp tables,
I'd already used those for GF(16) and GF(256), so I wanted to try
something different.
## Summary
More disappointing results, but at least I've added some more useful
building blocks.
As a high priority, I'll have to figure out how to emit assembly for
my generic field multiplication code.
I will leave both "good" multiplications where they are for the
moment, even though they are actually worse than the reference code.
When I have a more complete set of tables and I've refactored the code
to move them into `guff::tables`, I can set up more comprehensive
benchmarking of options and update `guff::good` according to the
results.