*  Strictly speaking according to the kernel definition in the challenge, I have implemented cross-correlation rather than convolution. I have assumed that the gradients for x and y convolutions are gradients in the direction of increasing index i.e. positive gradients, even though convolution definitionally requires that the kernel is reversed before its application. Swapping the implementation should be trivial.

* Duplication vs speed of execution?

* Parallel?

Future work:
* Better refactor code for x and y directions, probably with Rust macros? Not sure about the perf cost of refactoring logic into functions, how much inlining/optimisation the rust compiler can do under the hood in release mode etc.

* Test out performance/compare benchmarks with:
  - Numpy (which perhaps uses BLAS/LAPACK or some super optimised old Fortan voodoo?)
  - PyTorch (which will be almost instantaneous computation of the convolution on the GPU, with the calculation time dominated by gfx/cpu IO)
  - hand-crafted 2D C++ vectors or Rust vecs, along with some good old for loops.

* If you were really squeezing for another near factor of 2 space, you could maybe store the results in a combination of a signed int and a compressed bool structure (since you only need up to 2**9 bits to store the difference between two bytes), although I am not yet sure how this would be implemented.

* I had a really great time learning about rust, and can immediately apply some of it to my current work in e.g. multithreaded 3D meshing algorithms in WebAssembly, allowing interactive-speed segmentation editing for our radiology team.

RESULTS
Single-threaded optimized compile sees 10s to compute 10^5x10^5 array on Macbook Pro 2016 laptop