Last Reviewed: Seth T. 2022-10-19 Quality of Life 1. Verify kernel_info.maxThreadsPerBlock <= TPB_DEFAULT in BIT select loop. 2. Add info on what GPU kernels are available. 3. Copy over `s_bit` to the GPU in chucks of 2^20 to reduce gpu memory alloc 4. Improve Peak memory reporting during GPU runs See https://stackoverflow.com/questions/11631191/why-does-the-cuda-runtime-reserve-80-gib-virtual Big improvements * Automatic tuning for TPB, gpucurves gpu_throughput_test could be automated in the code and run when B1 > THRESHOLD Getting the wrong gpucurves / TPB pair can cost 2x/4x performance. * Flag to sleep between kernel calls This reduces performance but can help the responsiveness of the computer. Even 5ms of pause (~7% performance) made the system seem much more stable. * Some benchmark (gpu_throughput_test.sh?) to prevent performance regressions. Ideally it would record * Consider what changes would be needed to run multiple numbers at a time. `set_p_2p`, `findfactor`, and `process_results` seem easy to change `np0` would have to become an array. `n_log2` becames `max_n_log2`. Testing 1. Improve carry bit testing (see overflow test in check_gpuecm.sage) 2. Ask a C++ person and verify that GPU and CPU both use same endianness This could affect `to_mpz`, `from_mpz`, `allocate_and_set_s_bits`, and `set_p_2p` This is possibly handled by `endian` parameter in mpz_export Several things have been tried to improve performance that didn't turn out. These are recorded so we don't forget. 1. Branchless if(carry) { cgbn_sub(r, r, modulus) } can be replaced with cgbn_sub(r, r, zero_or_modulus[carry]); In theory this should help the different threads all stay alligned, in practice it didn't 2. Reducing number of reserved carry bits CARRY_BITS can be reduced form 6 to 2(?) be checking carry bit in addition to overflow after cgbn_add. This increase the size of number that can be run in cgbn_kernel. There's some performance penenalty for this so the code is uncommitted, but this is a big improvement for numbers 506-510 and 1018-1022 bits. 3. Removing the bn_t variables CB,DA,AA,BB,k,dK Only 2 (or possible 3) temporary variables are needed. This change didn't impact performance and hurt readability so was backed out. It's always possible that it could reduce registers 4. Add fast squaring to CGBN, see https://github.com/NVlabs/CGBN/issues/19