| Crates.io | lash-rs |
| lib.rs | lash-rs |
| version | 0.1.2 |
| created_at | 2025-07-23 22:27:12.052147+00 |
| updated_at | 2025-07-23 22:27:12.052147+00 |
| description | Genome/Metagenome sketching via HyperMinHash and UltraLogLog |
| homepage | |
| repository | https://github.com/jianshu93/lash |
| max_upload_size | |
| id | 1765330 |
| size | 100,268 |
Genome sketching can be extremely accurate but requires a huge amount of memory for MinHash-like algorithms. Recently, a new algorithm combining MinHash and HyperLogLog, called HyerMinHash was invented (1), which can perform MinHash in loglog space, a significant decrease in space/memory requirement. Together with lukaslueg, we first create a Rust library hyperminhash and then combine rolling hashing with HyperMinHash for extremely fast processing of genomic sequences. Xxhash3 was used as the underlying hashing technique.
More recently, an algorithm named Ultraloglog was invented (2). It is similar to Hyperloglog but with up to 28% more space efficiency due to a faster estimator. Ultraloglog also has better compaction when using compressing algorithms. Ultraloglog was implemented with waynexia, see ultraloglog. Both HyperMinHash and Ultraloglog are options available for use on our tool.
We employed a simple producer-consumer model to also reduce memory requirement for large files, e.g., metagenomic files. Both sketching and distance computation are parallelized to make full use of all CPU threads/cores.
There are two main subcommands, sketch and dist. Sketch is the sketching command and outputs 3 files; one file containing the sketches of the genomes (zstd compressed), one file containing the genome files used, and one file containing parameters used for the command. Dist is the command that "reads" the sketch files and outputs a file containing the distances between the query and reference genomes, which is specified by the user. More details on these commands are under "Usage".
We hope that you find this tool helpful in your scientific endeavors!
### pre-compiled binary for Linux
wget https://github.com/jianshu93/lash/releases/download/v0.1.2/lash_Linux_x86-64_v0.1.2.zip
unzip lash_Linux_x86-64_v0.1.2.zip
chomd a+x ./lash
./lash -h
### Install from cargo, install cargo first here: https://rustup.rs, cargo will be installed by default
cargo install lash
### compiling from source
git clone https://github.com/jianshu93/lash
cd lash
cargo build --release
./target/release/lash -h
************** initializing logger *****************
Fast and Memory Efficient Genome Sketching via HyperMinHash and UltraLogLog
lash sketch --file <file> --output <output_prefix> --kmer <kmer_length> --threads <num_threads>-algorighm <algorithm> -precision <precision_ull>
Options:
-f, --file <file> File containing list of FASTA files
-o, --output <output_prefix> Prefix you would like your output file names to start with
-k, --kmer <kmer_length> Length of k-mers
-t, --threads <num_threads> Number of threads you would like to use. Default to the number of cores on your device
-a, --algorithm <algorithm> Algorithm of choice. Either hmh for hyperminhash, or ull for ultraloglog
-p, --precision <precision_ull> Precision to use, only for Ultraloglog. Default to 10.
-v, --version Print version
lash dist --query <query__prefix> --reference <ref_prefix> --output <output_prefix>--threads <num_threads> --estimator <estimator_ull>
Options:
-q, --query <query_prefix> Prefix to search for your query genome files. Should match what you put as "output" from sketch.
-r, --reference <ref_prefix> Prefix to search for your reference genome files. Should match what you put as "output" from sketch.
-o, --output <output_prefix> Prefix you would like your output file names to start with
-t, --threads <num_threads> Number of threads you would like to use. Default to the number of cores on your device
-e, --estimator <estimator> Estimator to use, only for Ultraloglog sketches. Either "fgra" for Fast Graph-based Rank Aggregation or "ml" for maximum likelihood estimator, default to "ml".
-v, --version Print version
ls ./data/*.fasta > query_list_strep.txt
ls ./data/*.fasta > ref_list_strep.txt
lash sketch --query_file ./query_list_strep.txt -r ref_list_strep.txt -k 16 -o skh
lash dist -q ./skh -r ./skh -t 8 -o dist
Output format is the same with Mash, first column query, second column reference nameļ¼ third column Mash distance