# sketchy ![](https://img.shields.io/badge/lang-rust-black.svg) ![](https://img.shields.io/badge/version-0.6.0-green.svg) ![](https://img.shields.io/badge/preprint-0.12.0-green.svg) Genomic neighbor typing for lineage and genotype inference ## Overview **`v0.6.0`** `Sketchy` is a lineage calling and genotyping tool based on the heuristic principle of genomic neighbor typing developed by [Karel Břinda and colleagues (2020)](https://www.biorxiv.org/content/10.1101/403204v2). It queries species-wide ('hypothesis-agnostic') reference sketches using MinHash and infers associated genotypes based on the closest match, including multi-locus sequence types, susceptibility profiles, virulence factors or other genome-associated features provided by the user. Unlike the original implementation in [`RASE`](https://github.com/c2-d2/rase-pipeline), `sketchy` does not use phylogenetic trees which has some downsides, e.g. for sublineage genotype predictions (see below). See the [latest docs](https://esteinig.github.io/sketchy) for install, usage and database building. ## Strengths and limitations * Reference sketches and genotype indices can be constructed easily from large genome and genotype collections * `Sketchy` requires few resources when using small sketch sizes (`s = 1000`) * `Sketchy` performs best on lineage predictions and lineage-wide genotypes from very few reads - we found that tens to hundreds of reads can often give a good idea of the close matches in the reference sketch (especially when inspecting the top matches using `--top`) However: * Clade-specific genotype resolution is not as good as when using phylogenetic guide trees (`RASE`) * Sketch size can be increased to increase performance (`s = 10000`), but resources scale approximately linearly * `Sketchy` genotype inference may be difficult for species with high rates of homologous recombination ## Data availability * Reference sketches and genotype files (`s = 1000`, `s = 10000`, `k = 16`) for [*S. aureus*](https://cloudstor.aarnet.edu.au/plus/s/3EBgvXi6sVHW8Ne) (full genotypes including susceptibility predictions and other genotypes), *S. pneumoniae*, *K. pneumoniae*, *P. aeruginosa* and *Neisseria spp.* (MLST) can be found in the [data repository](https://cloudstor.aarnet.edu.au/plus/s/rL0RHYunqhRK3i1). * Reference sketches for cross-validation on the simulated species data can be found in this [data repository](https://cloudstor.aarnet.edu.au/plus/s/7ICPoSru6s6EHNY); genome assemblies for all species extracted from the ENA reference collection are available in this [data repository](https://cloudstor.aarnet.edu.au/plus/s/Td3ahBCPP2YAhCU) * Scripts to extract data from the ENA collections [Grace Blackwell et al.](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001421) and compute reference metrics can be found in the [scripts directory](scripts/). * Nanopore reads for the outbreak isolates and genotype surveillance panels in Papua New Guinea (Flongle, Goroka, sequential protocol) are available for download in the [data repository](https://cloudstor.aarnet.edu.au/plus/s/MFkirfq1N6uIosc). Raw sequence data (Illumina / ONT) is being uploaded to NCBI (PRJNA657380). ## Preprint If you use `sketchy` for research and other applications, please cite: > Steinig et al. (2022) - Genomic neighbor typing for bacterial outbreak surveillance - bioRxiv 2022.02.05.479210; doi: https://doi.org/10.1101/2022.02.05.479210