unifrac_rs

Crates.iounifrac_rs
lib.rsunifrac_rs
version0.2.1
created_at2025-07-11 05:49:23.983483+00
updated_at2025-09-10 03:25:44.269312+00
descriptionStriped UniFrac for comparing microbiomes using optimal balanced parenthesis
homepage
repositoryhttps://github.com/jianshu93/unifrac_bp.git
max_upload_size
id1747429
size217,900
Jianshu_Zhao (jianshu93)

documentation

README

UniFrac implememtation in Rust

This repo shows how to compute the UniFrac distance (unweighted, weighted and genralized) between pairs of samples containing taxa. It uses the succint data strucuture (balanced parenthesis) in succparen crate to represent a phylogenetic tree so that then the tree is huge, UniFrac computation can still be fast. SIMD was used to speed up all computations.

Striped UniFrac is the default algorithm and it is extremely fast for large number of samples. In fact, with sparse features of input samples, the complexity is close to O((N/s)^2), where s is average sparsity (average proportion of taxa detected at least once in pairs of samples/all taxa in the tree). An average sparsity of 5% indicates a 0.0025 scale down from O(N^2).

Right now, the performance matches C++ version of Striped UniFrac in unifrac-binaries (https://github.com/biocore/unifrac-binaries) (CPU only). I will stop optimizatizing here because this crate was developed for benchmark DartUniFrac.

Install

git clone https://github.com/jianshu93/unifrac_bp
cd unifrac_bp

cargo build --release
./target/release/unifrac -h

Usage

************** initializing logger *****************

Striped UniFrac via Optimal Balanced Parenthesis

Usage: unifrac [OPTIONS] --tree <tree> <--input <input>|--biom <biom>>

Options:
  -t, --tree <tree>        Input tree in Newick format
  -i, --input <input>      OTU/Feature table in TSV format
  -m, --biom <biom>        OTU/Feature table in BIOM (HDF5) format
      --weighted           Weighted UniFrac (normalized). Per-sample relative abundances will be used
  -T, --threads <threads>  Number of threads, default all logical cores
  -o, --output <output>    Output distance matrix in TSV format [default: unifrac.tsv]
      --generalized        Generalized UniFrac
      --alpha <alpha>      Alpha parameter for Generalized UniFrac (default 0.5). Only used with --generalized. [default: 0.5]
  -h, --help               Print help
  -V, --version            Print version

example

### Then run unifrac-rs like this:
./target/release/unifrac -t data/test.nwk -i data/test_OTU_table.txt  -o try.txt
cat try.txt

### Weighted 
./target/release/unifrac -t data/test.nwk -i data/test_OTU_table.txt --weighted  -o try.txt

### generalized
./target/release/unifrac -t data/test.nwk -i data/test_OTU_table.txt --generalized  -o try.txt

benchmark

#### Striped UniFrac, all logic cores/threads will be used by default
#### for 4204 microbiome samples with a total of ~0.5 million taxa, it took only ~30s on a M4 Max CPU. 
#### also ~30s on a Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz CPU with 32 cores
RUST_LOG=info ./target/release/unifrac -t ./GMTOLsong_table2024_N20_f2all_V4_table.nwk -m ./GMTOLsong_table2024_N20_f2all_V4_filt.biom -o GMTOLsong_dist_rs_biom.tsv
[2025-07-10T05:27:14Z INFO  unifrac_rs] Start parsing input.
[2025-07-10T05:27:14Z INFO  unifrac_rs] phase-1 masks built     40 ms
[2025-07-10T05:27:14Z INFO  unifrac_rs] phase-2 sparse lists built (66 strips)
[2025-07-10T05:27:45Z INFO  unifrac_rs] phase-3 block pass  31135 ms
[2025-07-10T05:27:46Z INFO  unifrac_rs] Start writing output.



### Weighted
$ RUST_LOG=info striped_unifrac -t ./GMTOLsong_table2024_N20_f2all_V4_table.nwk -m ./GMTOLsong_table2024_N20_f2all_V4_filt.biom --weighted -o GMTOLsong_dist_weighted_rs.tsv

 ************** initializing logger *****************

[2025-08-31T06:08:39Z INFO  striped_unifrac] logger initialized from default environment
[2025-08-31T06:08:42Z INFO  striped_unifrac] Total branches with positive length: 1039642
[2025-08-31T06:08:42Z INFO  striped_unifrac] Start parsing input.
[2025-08-31T06:08:53Z INFO  striped_unifrac] phase-1 (weighted)    245 ms
[2025-08-31T06:08:53Z INFO  striped_unifrac] phase-2 (weighted) lists built (62 strips)
[2025-08-31T06:09:24Z INFO  striped_unifrac] phase-3 (weighted)  12467 ms
[2025-08-31T06:09:24Z INFO  striped_unifrac] Start writing output.

For 50,000 samples on a 64-core AMD CPU. The performance matches that of C++ unifrac-binaries.

$unifrac -t ./ag_emp.tre -m ./ag_emp_even500.biom -o emp_50k_dist_rust.tsv

 ************** initializing logger *****************

real	12m17.507s
user	1486m0.383s
sys	3m17.306s


$time ssu -t ./ag_emp.tre -i ag_emp_even500.biom -o emp_50k_dist_c++.tsv -m unweighted

real
13m43.239s
user
108m10.510s
sys
0m50.950s

Weighted

$ RUST_LOG=info unifrac -t ./emp90.5000_1000_rxbl_placement_pruned75.tog.tre -m ./emp.90.min25.deblur.withtax.withtree.even1k.biom --weighted -o emp.90.weighted.simd.tsv

 ************** initializing logger *****************

[2025-09-06T01:24:17Z INFO  striped_unifrac] logger initialized from default environment
[2025-09-06T01:24:22Z INFO  striped_unifrac] Total branches with positive length: 953222
[2025-09-06T01:24:22Z INFO  striped_unifrac] Start parsing input.
[2025-09-06T01:24:29Z INFO  striped_unifrac] parent pointers built in 3 ms
[2025-09-06T01:24:29Z INFO  striped_unifrac] block geometry: blk=64, nblk=393, threads=128
[2025-09-06T01:34:07Z INFO  striped_unifrac] weighted striped pass done in 577646 ms
[2025-09-06T01:34:07Z INFO  striped_unifrac] Start writing output.

$ time unifrac-binaries -t ./emp90.5000_1000_rxbl_placement_pruned75.tog.tre -i ./emp.90.min25.deblur.withtax.withtree.even1k.biom -m weighted_normalized -o emp.90.weighted.C++.tsv

real	9m36.005s
user	788m34.395s
sys	3m11.660s

todo

GPU offloading via cudarc-rs

References

1.Lozupone, C. and Knight, R., 2005. UniFrac: a new phylogenetic method for comparing microbial communities. Applied and environmental microbiology, 71(12), pp.8228-8235.

2.Lozupone, C.A., Hamady, M., Kelley, S.T. and Knight, R., 2007. Quantitative and qualitative β diversity measures lead to different insights into factors that structure microbial communities. Applied and environmental microbiology, 73(5), pp.1576-1585.

3.Hamady, M., Lozupone, C. and Knight, R., 2010. Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data. The ISME journal, 4(1), pp.17-27.

4.McDonald, D., Vázquez-Baeza, Y., Koslicki, D., McClelland, J., Reeve, N., Xu, Z., Gonzalez, A. and Knight, R., 2018. Striped UniFrac: enabling microbiome analysis at unprecedented scale. Nature methods, 15(11), pp.847-848.

5.Sfiligoi, I., Armstrong, G., Gonzalez, A., McDonald, D. and Knight, R., 2022. Optimizing UniFrac with OpenACC yields greater than one thousand times speed increase. Msystems, 7(3), pp.e00028-22.

6.Chen, J., Bittinger, K., Charlson, E.S., Hoffmann, C., Lewis, J., Wu, G.D., Collman, R.G., Bushman, F.D. and Li, H., 2012. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics, 28(16), pp.2106-2113.

Commit count: 0

cargo fmt