* Add support for TPP pivot_mode * Handle middle of tree better (adapt small subtree kernel to allow inputs?) * Optimize small subtree factorization code * Figure out how to improve root node performance on many cores * Optimize/parallelize TPP code [or pass straight to parent?] * Other optimizations around delayed pivots * Write report on code * Sort out test deck * Sort out documentation * Parallel solve? * Add note that hwloc needs cuda support at compile time to work right for us? * Document rb_write and add C inteface