Crates.io | lightmotif |
lib.rs | lightmotif |
version | 0.9.1 |
source | src |
created_at | 2023-05-03 23:56:28.479279 |
updated_at | 2024-09-02 22:46:34.18844 |
description | A lightweight platform-accelerated library for biological motif scanning using position weight matrices. |
homepage | https://github.com/althonos/lightmotif |
repository | https://github.com/althonos/lightmotif |
max_upload_size | |
id | 856086 |
size | 4,876,490 |
lightmotif
A lightweight platform-accelerated library for biological motif scanning using position weight matrices.
Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides. Position weight matrices are often viewed as sequence logos:
The lightmotif
library provides a Rust crate to run very efficient
searches for a motif encoded in a position weight matrix. The position
scanning combines several techniques to allow high-throughput processing
of sequences:
permute
instructions of AVX2.Other crates from the ecosystem provide additional features if needed:
lightmotif-io
is a crate with parser implementations for various count matrix, frequency matrix and position-specific scoring matrix formats such as TRANSFAC or JASPAR.lightmotif-tfmpvalue
is an exact reimplementation of the TFM-PVALUE[4] algorithm for converting between a score and a p-value for a given scoring matrix.This is the Rust version, there is a Python package available as well.
use lightmotif::*;
use lightmotif::abc::Nucleotide;
// Create a count matrix from an iterable of motif sequences
let counts = CountMatrix::<Dna>::from_sequences(
["GTTGACCTTATCAAC", "GTTGATCCAGTCAAC"]
.into_iter()
.map(|s| EncodedSequence::encode(s).unwrap()),
)
.unwrap();
// Create a PSSM with 0.1 pseudocounts and uniform background frequencies.
let pssm = counts.to_freq(0.1).to_scoring(None);
// Use the pipeline to encode the target sequence into a striped matrix
let seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG";
let encoded = EncodedSequence::encode(seq).unwrap();
let mut striped = encoded.to_striped();
// Organize layout of striped matrix to allow scoring with PSSM.
striped.configure(&pssm);
// Compute scores for every position of the matrix.
let scores = pssm.score(&striped);
// Scores can be extracted into a Vec<f32>, or indexed directly.
let v = scores.unstripe();
assert_eq!(scores[0], -23.07094);
assert_eq!(v[0], -23.07094);
// Find the highest scoring position.
let best = scores.argmax().unwrap();
assert_eq!(best, 18);
// Find the positions above an absolute score threshold.
let indices = scores.threshold(10.0);
assert_eq!(indices, []);
This example uses a dynamic dispatch pipeline, which selects the best available backend (AVX2, SSE2, NEON, or a generic implementation) depending on the local platform.
Both benchmarks use the MX000001
motif from PRODORIC[5], and the
complete genome of an
Escherichia coli K12 strain.
Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with --target-cpu=native
.
Score every position of the genome with the motif weight matrix:
test bench_avx2 ... bench: 4,510,794 ns/iter (+/- 9,570) = 1029 MB/s
test bench_sse2 ... bench: 26,773,537 ns/iter (+/- 57,891) = 173 MB/s
test bench_generic ... bench: 317,731,004 ns/iter (+/- 2,567,370) = 14 MB/s
Find the highest-scoring position for a motif in a 10kb sequence
(compared to the PSSM algorithm implemented in
bio::pattern_matching::pssm
):
test bench_avx2 ... bench: 12,797 ns/iter (+/- 380) = 781 MB/s
test bench_sse2 ... bench: 62,597 ns/iter (+/- 43) = 159 MB/s
test bench_generic ... bench: 671,900 ns/iter (+/- 1,150) = 14 MB/s
test bench_bio ... bench: 1,193,911 ns/iter (+/- 2,519) = 8 MB/s
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.
This library is provided under the open-source MIT license.
This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.