Crates.io | rumi |
lib.rs | rumi |
version | 0.2.1 |
source | src |
created_at | 2019-11-25 18:26:05.071144 |
updated_at | 2019-11-25 18:35:03.846114 |
description | PCR Deduplication via directional adjacency |
homepage | |
repository | https://github.com/sstadick/rumi |
max_upload_size | |
id | 184263 |
size | 115,912 |
This is beta software, use at your own risk
Rust UMI based PCR deduplication based on the directional adjacency as UMI-tools but with a constant time hamming distance implementation.
For now this relies on the rust toolchain. There are excellent docs on how to set that up.
cargo install rumi
$ rumi --help
rumi-dedup 0.1.0
Seth Stadick <sstadick@gmail.com>
Deduplicate reads based on umis
USAGE:
rumi [FLAGS] [OPTIONS] <INBAM> --output <OUTBAM> --umi_tag <umi_tag>
FLAGS:
--group_only Don't deduplicate reads, just group them given them agroup id, and print them. Rules
for filtering out unpaired reads, etc, will still be applied.
-h, --help Prints help information
--ignore_splice_pos If two reads have the same start pos, and contain a splice site, they will be
grouped together, instead of further splitting them based on the
splice site
--is_paired Input is paired end. Read pairs with unmapped read1 will be ignored.
--umi_in_read_id The UMI is located in the read id after the last '_'. Otherwise use the RX tag.
-V, --version Prints version information
OPTIONS:
-o, --output <OUTBAM> Output bam file. Use - if stdout [default: -]
-c, --allowed_count_factor <allowed_count_factor>
The factor to multiply the count of a umi by when determining whether or not to group it with other umis
within allowed_read_dist. include umi_b as adjacent to umi_a if: umi_a.counts >= allowed_count_factor *
umi_b.counts [default: 2]
-n, --allowed_network_depth <allowed_network_depth>
The number of nodes deep to go when creating a group. If allowed_read_dist 1, then allowed_network_depth of
2 will enable getting all umis with hamming distance of 2 from current umi. [default: 2]
-d, --allowed_read_dist <allowed_read_dist>
The distance between umis that will allow them to be counted as adjacent. [default: 1]
-u, --umi_tag <umi_tag> The tag holding the umi information. [default: RX]
ARGS:
<INBAM> Input bam file. Use - if stdin [default: -]
I have not sat down and done any serious benchmarking yet. Anecdotally this is at least 4X faster than umi_tools on small datasets. There are A LOT of low hanging fruit in terms of optimizations to apply though.
I would fully expect that this implementation should be capable of at least a 10x performance boost once it's been smoothed out. The large advantage this has over umi_tools is that it can take advantage of multiple cores. umi_tools has already shifted a large amount of it's work into C code, so just having a compiled language isn't a huge advantage.
First pass: Collect all reads into a dict that is keyed on position. Track metrics like umi freq, and extracted umis while building this. Then iter over that dict and deduplicate at each position.