lash-rs

Crates.iolash-rs
lib.rslash-rs
version0.1.2
created_at2025-07-23 22:27:12.052147+00
updated_at2025-07-23 22:27:12.052147+00
descriptionGenome/Metagenome sketching via HyperMinHash and UltraLogLog
homepage
repositoryhttps://github.com/jianshu93/lash
max_upload_size
id1765330
size100,268
Claire Wang (clairewowo)

documentation

README

Latest Version

Fast and Memory Efficient Genome/Metagenome Sketching via HyperMinHash and UltraLogLog

Genome sketching can be extremely accurate but requires a huge amount of memory for MinHash-like algorithms. Recently, a new algorithm combining MinHash and HyperLogLog, called HyerMinHash was invented (1), which can perform MinHash in loglog space, a significant decrease in space/memory requirement. Together with lukaslueg, we first create a Rust library hyperminhash and then combine rolling hashing with HyperMinHash for extremely fast processing of genomic sequences. Xxhash3 was used as the underlying hashing technique.

More recently, an algorithm named Ultraloglog was invented (2). It is similar to Hyperloglog but with up to 28% more space efficiency due to a faster estimator. Ultraloglog also has better compaction when using compressing algorithms. Ultraloglog was implemented with waynexia, see ultraloglog. Both HyperMinHash and Ultraloglog are options available for use on our tool.

We employed a simple producer-consumer model to also reduce memory requirement for large files, e.g., metagenomic files. Both sketching and distance computation are parallelized to make full use of all CPU threads/cores.

There are two main subcommands, sketch and dist. Sketch is the sketching command and outputs 3 files; one file containing the sketches of the genomes (zstd compressed), one file containing the genome files used, and one file containing parameters used for the command. Dist is the command that "reads" the sketch files and outputs a file containing the distances between the query and reference genomes, which is specified by the user. More details on these commands are under "Usage".

We hope that you find this tool helpful in your scientific endeavors!

Quick install

### pre-compiled binary for Linux
wget https://github.com/jianshu93/lash/releases/download/v0.1.2/lash_Linux_x86-64_v0.1.2.zip
unzip lash_Linux_x86-64_v0.1.2.zip
chomd a+x ./lash
./lash -h

### Install from cargo, install cargo first here: https://rustup.rs, cargo will be installed by default
cargo install lash

### compiling from source
git clone https://github.com/jianshu93/lash
cd lash
cargo build --release
./target/release/lash -h

Usage


 ************** initializing logger *****************

Fast and Memory Efficient Genome Sketching via HyperMinHash and UltraLogLog

lash sketch --file <file> --output <output_prefix> --kmer <kmer_length> --threads <num_threads>-algorighm <algorithm> -precision <precision_ull>

Options:
  -f, --file <file>                 File containing list of FASTA files
  -o, --output <output_prefix>      Prefix you would like your output file names to start with
  -k, --kmer <kmer_length>          Length of k-mers
  -t, --threads <num_threads>       Number of threads you would like to use. Default to the number of cores on your device
  -a, --algorithm <algorithm>       Algorithm of choice. Either hmh for hyperminhash, or ull for ultraloglog  
  -p, --precision <precision_ull>   Precision to use, only for Ultraloglog. Default to 10. 
  -v, --version                     Print version


lash dist --query <query__prefix> --reference <ref_prefix> --output <output_prefix>--threads <num_threads> --estimator <estimator_ull>
Options:
  -q, --query <query_prefix>        Prefix to search for your query genome files. Should match what you put as "output" from sketch. 
  -r, --reference <ref_prefix>      Prefix to search for your reference genome files. Should match what you put as "output" from sketch. 
  -o, --output <output_prefix>      Prefix you would like your output file names to start with
  -t, --threads <num_threads>       Number of threads you would like to use. Default to the number of cores on your device
  -e, --estimator <estimator>       Estimator to use, only for Ultraloglog sketches. Either "fgra" for Fast Graph-based Rank Aggregation or "ml" for maximum likelihood estimator, default to "ml".  
  -v, --version                     Print version


ls ./data/*.fasta > query_list_strep.txt
ls ./data/*.fasta > ref_list_strep.txt
lash sketch --query_file ./query_list_strep.txt -r ref_list_strep.txt -k 16 -o skh
lash dist -q ./skh -r ./skh -t 8 -o dist

Output

Output format is the same with Mash, first column query, second column reference name, third column Mash distance

References

  1. Yu YW, Weber GM. Hyperminhash: Minhash in loglog space. IEEE Transactions on Knowledge and Data Engineering. 2020 Mar 17;34(1):328-39.
  2. Ertl O. UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct Counting. Proceedings of the VLDB Endowment. 2024 March 1;17(7):1655-1668.
Commit count: 0

cargo fmt