sketchy-rs

Crates.io	sketchy-rs
lib.rs	sketchy-rs
version	0.6.0
source	src
created_at	2020-02-26 23:54:33.254547+00
updated_at	2022-08-26 08:07:22.493058+00
description	Rust command line client for Sketchy
homepage	https://github.com/esteinig/sketchy
repository	https://github.com/esteinig/sketchy
max_upload_size
id	212866
size	52,056

Eike Steinig (esteinig)

documentation

https://github.com/esteinig/sketchy

README

sketchy

Genomic neighbor typing for lineage and genotype inference

Overview

v0.6.0

Sketchy is a lineage calling and genotyping tool based on the heuristic principle of genomic neighbor typing developed by Karel Břinda and colleagues (2020). It queries species-wide ('hypothesis-agnostic') reference sketches using MinHash and infers associated genotypes based on the closest match, including multi-locus sequence types, susceptibility profiles, virulence factors or other genome-associated features provided by the user. Unlike the original implementation in RASE, sketchy does not use phylogenetic trees which has some downsides, e.g. for sublineage genotype predictions (see below).

See the latest docs for install, usage and database building.

Strengths and limitations

Reference sketches and genotype indices can be constructed easily from large genome and genotype collections
Sketchy requires few resources when using small sketch sizes (s = 1000)
Sketchy performs best on lineage predictions and lineage-wide genotypes from very few reads - we found that tens to hundreds of reads can often give a good idea of the close matches in the reference sketch (especially when inspecting the top matches using --top)

However:

Clade-specific genotype resolution is not as good as when using phylogenetic guide trees (RASE)
Sketch size can be increased to increase performance (s = 10000), but resources scale approximately linearly
Sketchy genotype inference may be difficult for species with high rates of homologous recombination

Data availability

Reference sketches and genotype files (s = 1000, s = 10000, k = 16) for S. aureus (full genotypes including susceptibility predictions and other genotypes), S. pneumoniae, K. pneumoniae, P. aeruginosa and Neisseria spp. (MLST) can be found in the data repository.
Reference sketches for cross-validation on the simulated species data can be found in this data repository; genome assemblies for all species extracted from the ENA reference collection are available in this data repository
Scripts to extract data from the ENA collections Grace Blackwell et al. and compute reference metrics can be found in the scripts directory.
Nanopore reads for the outbreak isolates and genotype surveillance panels in Papua New Guinea (Flongle, Goroka, sequential protocol) are available for download in the data repository. Raw sequence data (Illumina / ONT) is being uploaded to NCBI (PRJNA657380).

Preprint

If you use sketchy for research and other applications, please cite:

Steinig et al. (2022) - Genomic neighbor typing for bacterial outbreak surveillance - bioRxiv 2022.02.05.479210; doi: https://doi.org/10.1101/2022.02.05.479210

Commit count: 1706