Crates.io | skani |
lib.rs | skani |
version | 0.1.1 |
source | src |
created_at | 2023-02-07 23:17:02.439886 |
updated_at | 2023-04-09 19:27:59.66477 |
description | skani is a fast tool for calculating ANI between metagenomic sequences, such as metagenome-assembled genomes (MAGs). It is extremely fast and is robust against incompleteness and fragmentation, giving accurate ANI estimates. |
homepage | https://github.com/bluenote-1577/skani |
repository | https://github.com/bluenote-1577/skani |
max_upload_size | |
id | 779185 |
size | 15,749,832 |
skani is a program for calculating average nucleotide identity (ANI) from DNA sequences (contigs/MAGs/genomes) for ANI > ~80%.
skani uses an approximate mapping method without base-level alignment to get ANI. It is magnitudes faster than BLAST based methods and almost as accurate. skani offers:
Accurate ANI calculations for MAGs. skani is accurate for incomplete and medium-quality metagenome-assembled genomes (MAGs). Sketching methods (e.g. Mash), which may underestimate ANI for incomplete MAGs.
Aligned fraction results. skani outputs the fraction of genome aligned, whereas sketching methods do not.
Fast computations. Indexing/sketching is ~ 3x faster than Mash, and querying is about 25x faster than FastANI (but slower than Mash).
Efficient database search. Querying a genome against a preprocessed database of >65000 prokaryotic genomes takes a few seconds with a single processor and ~6 GB of RAM. Constructing a database from genome sequences takes a few minutes to an hour.
Requirements:
Building takes a few minutes (depending on # of cores).
git clone https://github.com/bluenote-1577/skani
cd skani
# If default rust install directory is ~/.cargo
cargo install --path . --root ~/.cargo
skani dist refs/e.coli-EC590.fasta refs/e.coli-K12.fasta
# If ~/.cargo doesn't exist use below commands instead
#cargo build --release
#./target/release/skani dist refs/e.coli-EC590.fasta refs/e.coli-K12.fasta
We offer a pre-built statically compiled executable for x86-64 linux systems. That is, if you're on a x86-64 linux system, you can just download the binary and run it without installing anything.
For using the latest version of skani:
wget https://github.com/bluenote-1577/skani/releases/download/latest/skani
chmod +x skani
./skani -h
Note: the binary is compiled with a different set of libraries (musl instead of glibc), possibly impacting performance (slightly). Probably not a huge deal.
See the Releases page for obtaining specific versions of skani.
conda install -c bioconda skani
Note: skani is being developed quickly and the conda version may be outdated; see the version status above.
# compare two genomes for ANI.
# all options take -t for multi-threading.
skani dist genome1.fa genome2.fa -t 5
# compare multiple genomes
skani dist -q query1.fa query2.fa -r reference1.fa reference2.fa -o all-to-all_results.txt
# construct database and do memory-efficient search
skani sketch genomes_to_search/* -o database
skani search query1.fa query2.fa ... -d database
# use sketch from "skani sketch" output as drop-in replacement
skani dist database/query.fa.sketch database/ref.fa.sketch
# construct distance matrix for all genomes in folder
skani triangle genome_folder/* > skani_ani_matrix.txt
# we provide a script in this repository for clustering/visualizing distance matrices.
# requires python3, seaborn, scipy/numpy, and matplotlib.
python scripts/clustermap_triangle.py skani_ani_matrix.txt
For more information about using the specific skani subcommands, see the guide linked above.
See the advanced usage guide linked above for more information about topics such as:
If the resulting aligned fraction for the two genomes is < 15%, no output is given.
In practice, this means that only results with > ~82% ANI are reliably output (with default parameters). See the skani advanced usage guide for information on how to compare lower ANI genomes.
The default output for search
and dist
looks like
Ref_file Query_file ANI Align_fraction_ref Align_fraction_query Ref_name Query_name
refs/e.coli-EC590.fasta refs/e.coli-K12.fasta 99.39 93.95 93.37 NZ_CP016182.2 Escherichia coli strain EC590 chromosome, complete genome NC_007779.1 Escherichia coli str. K-12 substr. W3110, complete sequence
Jim Shaw and Yun William Yu. Fast and robust metagenomic sequence comparison through sparse chaining with skani. bioRxiv (2023). https://doi.org/10.1101/2023.01.18.524587. Submitted.
We added new experiments on the revised version of our preprint (Extended Data Figs 11-14). We show skani has quite good AF correlation with MUMmer, and that it works decently on simple eukaryotic MAGs, especially with the --slow
option (see below).
--no-learned-ani
.--fast
and --slow
mode added - alias for -c 200
and -c 30
respectively.skani is actively being developed by me (Jim Shaw). I'm more than happy to accommodate simple feature requests (different types of outputs, etc). Feel free to open an issue with your feature request on the github repository. If you catch any bugs, please open an issue or e-mail me (e-mail on my website).
If you're interested in using skani as a rust library, check out the minimal example here: https://github.com/bluenote-1577/skani-lib-example. The documentation is currently minimal (https://docs.rs/skani/0.1.0/skani/) and I guarantee no API stability.
If you're interested in calling skani from python, see the pyskani python interface and bindings to skani written by Martin Larralde. Note: I am not personally involved in the pyskani project and do not offer guarantees on correctness of the outputs.