Crates.io | nucleic-acid |
lib.rs | nucleic-acid |
version | 0.1.1 |
source | src |
created_at | 2017-04-23 05:55:44.049775 |
updated_at | 2017-05-31 06:15:36.501413 |
description | Implementations of Suffix Array, Burrows-Wheeler Transform and FM-index |
homepage | |
repository | https://github.com/Wafflespeanut/nucleic-acid |
max_upload_size | |
id | 11627 |
size | 46,417 |
This Rust library has some of the bioinformatics stuff I'd written for playing with DNA sequences. It has the following implementations:
BitsVec
provides a generic interface for stuff that have a known bit range.Add this to your Cargo.toml
nucleic-acid = "0.1"
See the documentation for exact usage and detailed examples.
The implementations for BWT and FM-index have already been provided by the awesome rust-bio
library. But, that's not great for large datasets (~4 GB). This library was written to handle such datasets efficiently.
BitsVec
Note that BitsVec
is a lot slower compared to Vec
, because, we can't move pointers by bits, and so we gotta do some shifting and bitwise stuff to achieve this. That's at least 7-10 additional operations (per function call) in addition to the pointer read/write. So, it's slow.
bench_1_bits_vec_fill_with_1000_elements ... bench: 1,961 ns/iter (+/- 142)
bench_1_bits_vec_get_1000_ints ... bench: 26,429 ns/iter (+/- 281)
bench_1_bits_vec_push_1000_ints ... bench: 8,574 ns/iter (+/- 1,409)
bench_1_bits_vec_set_1000_ints ... bench: 31,423 ns/iter (+/- 948)
bench_22_bits_vec_fill_with_1000_elements ... bench: 1,422 ns/iter (+/- 184)
bench_22_bits_vec_get_1000_ints ... bench: 28,098 ns/iter (+/- 458)
bench_22_bits_vec_push_1000_ints ... bench: 11,701 ns/iter (+/- 3,853)
bench_22_bits_vec_set_1000_ints ... bench: 32,632 ns/iter (+/- 1,032)
bench_40_bits_vec_fill_with_1000_elements ... bench: 1,941 ns/iter (+/- 123)
bench_40_bits_vec_get_1000_ints ... bench: 27,771 ns/iter (+/- 2,613)
bench_40_bits_vec_push_1000_ints ... bench: 13,475 ns/iter (+/- 5,716)
bench_40_bits_vec_set_1000_ints ... bench: 32,786 ns/iter (+/- 1,649)
bench_63_bits_vec_fill_with_1000_elements ... bench: 3,078 ns/iter (+/- 273)
bench_63_bits_vec_get_1000_ints ... bench: 29,247 ns/iter (+/- 2,903)
bench_63_bits_vec_push_1000_ints ... bench: 20,756 ns/iter (+/- 2,717)
bench_63_bits_vec_set_1000_ints ... bench: 34,674 ns/iter (+/- 2,819)
As you may notice, this becomes inefficient once you approach the size of usize
(in my case, pushing 63 bit values is a lot slower than pushing 22 or 40 bit values).
suffix_array
Since the induced sorting method is O(n), it's a lot faster than the usual O(nlogn) sorting, and it's also memory efficient.
bench_sort_rotations_1000_random_values ... bench: 292,912 ns/iter (+/- 24,688)
bench_suffix_array_1000_random_values ... bench: 100,227 ns/iter (+/- 16,021)
FMIndex
FM-index is very fast in its construction and getting, but it consumes a lot of memory (almost the same as the suffix array). There are multiple ways to optimize this (I'll try to do it in the future).
bench_fm_index_1000_random_values_constructor ... bench: 115,514 ns/iter (+/- 20,053)
bench_fm_index_1000_random_values_get_100_chars ... bench: 1,094 ns/iter (+/- 78)