prot_translate

Crates.io	prot_translate
lib.rs	prot_translate
version	0.1.0
source	src
created_at	2024-04-15 17:08:43.581295
updated_at	2024-04-15 17:08:43.581295
description	Translate nucleotide sequence to protein.
homepage	https://github.com/DorianCoding/prot_translate
repository	https://github.com/DorianCoding/prot_translate
max_upload_size
id	1209459
size	45,592

(DorianCoding)

documentation

https://docs.rs/prot_translate/

README

prot_translate

Translate nucleotide sequence (dna or rna) to protein.

Usage

Add this to your Cargo.toml:

[dependencies]
prot_translate = "0.1.0"

Example

use prot_translate::*;

fn main() {
    let dna = b"GTGAGTCGTTGAGTCTGATTGCGTATC";
    let protein = translate(dna);
    assert_eq!("VSR*V*LRI", &protein);
    let dna = b"GCTAGTCGTATCGTAGCTAGTC";
     let peptide = translate3(dna,None);
     assert_eq!(&peptide, "AlaSerArgIleValAlaSer");
    // To shift reading frame
    let protein_frame2 = translate(&dna[1..]);
    assert_eq!("*VVESDCV", &protein_frame2);
    let dna = b"GCTAGTCGTATCGTAGCTAGTC";
     let peptide = translate_full(dna,None);
     assert_eq!(&peptide, "AlanineSerineArginineIsoleucineValineAlanineSerine");
}

Benchmarks

The current algorithm is inspired by seqan's implementation which uses array indexing. Here is how it performs vs other methods (tested on 2012 macbook pro).

Method	10 bp*	100 bp	1,000 bp	10,000 bp	100,000 bp	1 million bp
prot_translate	91 ns	0.29 μs	2.28 μs	23 μs	215 μs	2.25 ms
fnv hashmap	111 ns	0.37 μs	3.58 μs	37 μs	366 us	3.86 ms
std hashmap	160 ns	1.03 μs	9.65 μs	100 μs	943 μs	9.40 ms
phf_map	177 ns	1.04 μs	9.47 μs	100 μs	936 μs	9.91
match statement	259 ns	1.77 μs	17.9 μs	163 μs	1941 μs	19.1 ms
prot_translate (unchecked)	90 ns	0.26 μs	2.02 μs	20 μs	197 μs	1.92 ms

*bp = "base pairs"

To benchmark yourself (have to use nightly because of phf_map macro).

cargo +nightly bench

Thoughts

FNV seems to be a great option, but I have chosen to use the current implementation due to being slightly faster and not required any dependencies.
There was originally a function called translate_unchecked that did not validate each byte for valid ASCII, but since the performance gain was negligible, it was removed.

Todo

Add other Codon tables (e.g. Vertebrate Mitochondrial, Yeast Mitochondrial, Mold Mitochondrial, etc.)
Add support for ambiguous nucleotides (right now, only supports A, U, T, C, G)

Tests

To test

cargo test

To can also generate new test data (requires python3 and biopython).

# Generate 500 random sequences and their peptides
python3 tests/generate_test_data.py 500

Commit count: 0

prot_translate

documentation

README

prot_translate

Usage

Example

Benchmarks

Thoughts

Todo

Tests

cargo fmt