TODOs:

General:

1) Check for malloc errors.

2) grep FIXME *.c *.h

asm level:

All the SP code is *heavily* dependent on the C compiler. The performance gap
between unoptimized and -O3 code is huge. Maybe a couple of asm routines would
be appropriate if some kind soul were willing to write them.

1) spv_dotproduct. Given spvs a and b, compute 
   a[0] * b[0] + ... a[n-1] * b[n-1] mod p

   Useful in mpzspp_normalize.

2) spv_ntt_gfp_dif, spv_ntt_gfp_dit. Most of stage 2 is spent on these
   functions; maybe a hand-written version will outperform gcc's efforts.


SP level:

1) Maybe use 64->32 bit Montgomery reduction instead of GMP's
   udiv_qrnnd_preinv2norm. Preliminary tests indicate this gives a 10% speedup
   in ntt_gfp_dif. See sp_montmul in sp.h.

   UPDATE: This code has been written (but is not in cvs). It turns out that
           sp_montmul as presented is incorrect as we have to check for
	   overflows. Unfortunately this cancels out the 10% speedup.

SPV level:

1) toom3 and toom4 have been translated directly from toomcook.c but seem to be
   very slow. Try lowering the operation count (i.e. increase the number of
   muls) and see if we can fit a toom3/4 in-between karatsuba and ntt.

4) Try out the NTT over GP(p^2). Old code doing this exists (DN), I'll see if
   I can get it to work at some point. This is why the function is called
   spv_ntt_gfp_dif, not just spv_ntt_dif.

5) Split-radix NTT. Old code doing this exists (DN), etc... I read somewhere
   that "A well-implemented radix-2 NTT can outperform a badly-implemented
   split-radix NTT"

MPZSPP level:

1) Find a better way of doing mpzspp_normalize.

2) Use a division tree for mpzspp_set_mpzp. Note however that these aren't
   too time-critical if we rewrite the high-level poly functions (see below)
   so they don't convert to/from mpzspps for every poly mul.