TODO (Mar 2024):

    set/get cuda archictures
    CUDA PreJIT kernels
    GB_cuda_matrix_advise:  write it
    dot3: allow iso
    use a stream pool (from RMM)
    can rmm_wrap be thread safe?
    # of threadblocks in reduce
    reduce calls GB_enumify_reduce twice
    set/get which GPU(s) to use
    data types > 32 bytes
    handling nvcc compiler errors
    static device function for computing ks (acts like GB_ek_slice,
        so call it GB_ek_slice_device

--------------------------------------------------------------------------------

all the FIXMEs

clean up comments and code style

hexadecimal

stream pool

test complex

reduce: do any monoid
    terminal condition?

ANY monoid in mxm

full test suite

when to use the GPU?  which GPU?  See the new GxB_Context object

rmm_init

--------------------------------------------------------------------------------
future:

cuda: needs source directory
    (1) environment var set.  If so, use it.
    (3) not found so no cuda jit

all of GrB_mxm?

(1) DOT
        dot2:       C<#> = A'*B     C is bitmap or full
        dot3:       C<M>=A'B        C is sparse empty, M is sparse/hyper
        dot4:       C += A'B        C is full (+ same as semiring monoid)

(2)     colscale    C = A*D

(3)     rowscale    C = D*B

(4) SAXPY

        saxpy3      C<#> = A*B      C is sparse or hyper

        saxpy4      C += A*B        C is full, A hyper or sparse
                                    B is full or bitmap

        saxpy5      C += A*B        C is full, B hyper or sparse
                                    A is full or bitmap

        bitmap      C<#> = A*B      C bitmap

NO      bitmap      C<#> += A*B     C bitmap
NO      outer       C<#> = AB'      C sparse? full? bitmap?

    GrB_select?