Crates.io | yacrd |
lib.rs | yacrd |
version | 0.6.2 |
source | src |
created_at | 2018-07-19 00:13:44.553519 |
updated_at | 2020-07-18 19:19:20.394854 |
description | Using all-against-all read mapping, yacrd performs: computation of pile-up coverage for each read and detection of chimeras |
homepage | https://github.com/natir/yacrd |
repository | https://github.com/natir/yacrd |
max_upload_size | |
id | 74970 |
size | 153,712 |
Using all-against-all read mapping, yacrd performs:
Chimera detection is done as follows:
min_coverage
(default 0), yacrd creates a bad region.Chimeric
NotCovered
Chimeric
or NotCovered
is NotBad
Long read error-correction tools usually detect and also remove chimeras. But it is difficult to isolate or retrieve information from just this step.
DAStrim (from the DASCRUBBER suite does a similar job to yacrd but relies on a different mapping step, and uses different (likely more advanced) heuristics. Yacrd is simpler and easier to use.
This repository contains a set of scripts to evaluate yacrd against other similar tools such as DASCRUBBER and miniscrub on real data sets.
Any set of long reads (PacBio, Nanopore, anything that can be given to minimap2). yacrd takes the resulting PAF (Pairwise Alignement Format) from minimap2 or BLASR m4 file from some other long reads overlapper as input.
yacrd is avaible in bioconda channel
if bioconda channel is setup you can run :
conda install yacrd
git clone https://github.com/natir/yacrd.git
cd yacrd
git checkout v0.6.1
cargo build
cargo test
cargo install --path .
minimap2 reads.fq reads.fq > overlap.paf
yacrd -i overlap.paf -o reads.yacrd
yacrd can perform some post-detection operation:
minimap2 reads.fq reads.fq > mapping.paf
yacrd -i mapping.paf -o reads.yacrd filter -i reads.fasta -o reads.filter.fasta
yacrd -i mapping.paf -o reads.yacrd extract -i reads.fasta -o reads.extract.fasta
yacrd -i mapping.paf -o reads.yacrd split -i reads.fasta -o reads.split.fasta
yacrd -i mapping.paf -o reads.yacrd scrubb -i reads.fasta -o reads.scrubb.fasta
For nanopore data, we recommend using minimap2 with all-vs-all nanopore preset with a maximal distance between seeds fixe to 500 (option -g 500
) to generate overlap. We recommend to run yacrd with minimal coverage fixed to 4 (option -c
) and minimal coverage of read fixed to 0.4 (option -n
).
This is an exemple of how run a yacrd scrubbing:
minimap2 -x ava-ont -g 500 reads.fasta reads.fasta > overlap.paf
yacrd -i overlap.paf -o report.yacrd -c 4 -n 0.4 scrubb -i reads.fasta -o reads.scrubb.fasta
For pacbio P6-C4 data, we recommend to use minimap2 with all-vs-all pacbio preset with a maximal distance between seeds fixe to 800 (option -g 800
) to generate overlap. We recommend to run yacrd with minimal coverage fixed to 4 (option -c 4
) and minimal coverage of read fixed to 0.4 (option -n 0.4
).
minimap2 -x ava-pb -g 800 reads.fasta reads.fasta > overlap.paf
yacrd -i overlap.paf -o report.yacrd -c 4 -n 0.4 scrubb -i reads.fasta -o reads.scrubb.fasta
For pacbio Sequel data, we recommend to use minimap2 with all-vs-all pacbio preset with a maximal distance between seeds fixe to 5000 (option -g 5000
) to generate overlap. We recommand to run yacrd with minimal coverage fixed to 3 (option -c 3
) and minimal coverage of read fixed to 0.4 (option -n 0.4
).
minimap2 -x ava-pb -g 5000 reads.fasta reads.fasta > overlap.paf
yacrd -i overlap.paf -o report.yacrd -c 3 -n 0.4 scrubb -i reads.fasta -o reads.scrubb.fasta
yacrd use extension to detect format file if your filename contains (anywhere):
.paf
: file is consider has minimap file.m4
, .mhap
: file is consider has blasr m4 file (mhap output).fa
, .fasta
: file is consider has fasta file.fq
, .fastq
: file is consider has fastq file.yacrd
: file is consider has yacrd output fileyacrd automatically detect file if is compress or not (gzip, bzip2 and lzma compression is available). For post-detection operation, if input is compressed output have the same compression format.
You can use yacrd report as input in place of overlap file, ondisk
option are ignored if you use yarcd report has input.
type_of_read id_in_mapping_file length_of_read length_of_gap,begin_pos_of_gap,end_pos_of_gap;length_of_gap,be…
NotCovered readA 4599 3782,0,3782
Here, readA doesn't have sufficient coverage, there is a zero-coverage region of length 3782bp between positions 0 and 3782.
Chimeric readB 10452 862,1260,2122;3209,4319,7528
Here, readB is chimeric with 2 zero-coverage regions: one between bases 1260 and 2122, another between 4319 and 7528.
If you use yacrd in your research, please cite the following publication:
Pierre Marijon, Rayan Chikhi, Jean-Stéphane Varré, yacrd and fpa: upstream tools for long-read genome assembly, Bioinformatics, btaa262, https://doi.org/10.1093/bioinformatics/btaa262
bibtex format:
@article {@article{Marijon_2020,
doi = {10.1093/bioinformatics/btaa262},
url = {https://doi.org/10.1093%2Fbioinformatics%2Fbtaa262},
year = 2020,
month = {apr},
publisher = {Oxford University Press ({OUP})},
author = {Pierre Marijon and Rayan Chikhi and Jean-St{\'{e}}phane Varr{\'{e}}},
editor = {Inanc Birol},
title = {yacrd and fpa: upstream tools for long-read genome assembly},
journal = {Bioinformatics}
}