Crates.io | lightmotif-py |
lib.rs | lightmotif-py |
version | 0.9.1 |
source | src |
created_at | 2023-05-04 00:17:52.175607 |
updated_at | 2024-09-02 22:47:00.348475 |
description | PyO3 bindings and Python interface to the lightmotif crate. |
homepage | https://github.com/althonos/lightmotif/tree/main/lightmotif-py |
repository | https://github.com/althonos/lightmotif |
max_upload_size | |
id | 856100 |
size | 232,790 |
lightmotif
A lightweight platform-accelerated library for biological motif scanning using position weight matrices.
Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides. Position weight matrices are often viewed as sequence logos:
The lightmotif
library provides a Python module to run very efficient
searches for a motif encoded in a position weight matrix. The position
scanning combines several techniques to allow high-throughput processing
of sequences:
permute
instructions of AVX2.This is the Python version, there is a Rust crate available as well.
lightmotif
can be installed directly from PyPI,
which hosts some pre-built wheels for most mainstream platforms, as well as the
code required to compile from source with Rust:
$ pip install lightmotif
In the event you have to compile the package from source, all the required Rust libraries are vendored in the source distribution, and a Rust compiler will be setup automatically if there is none on the host machine.
The motif interface should be mostly compatible with the
Bio.motifs
module from Biopython. The notable difference is that
the calculate
method of PSSM objects expects a striped sequence instead.
import lightmotif
# Create a count matrix from an iterable of sequences
motif = lightmotif.create(["GTTGACCTTATCAAC", "GTTGATCCAGTCAAC"])
# Create a PSSM with 0.1 pseudocounts and uniform background frequencies
pwm = motif.counts.normalize(0.1)
pssm = pwm.log_odds()
# Encode the target sequence into a striped matrix
seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG"
striped = lightmotif.stripe(seq)
# Compute scores using the fastest backend implementation for the host machine
scores = pssm.calculate(sseq)
Benchmarks use the MX000001
motif from PRODORIC[4], and the
complete genome of an
Escherichia coli K12 strain.
Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with --target-cpu=native
.
lightmotif (avx2): 5,479,884 ns/iter (+/- 3,370,523) = 807.8 MiB/s
Bio.motifs: 334,359,765 ns/iter (+/- 11,045,456) = 13.2 MiB/s
MOODS.scan: 182,710,624 ns/iter (+/- 9,459,257) = 24.2 MiB/s
pymemesuite.fimo: 239,694,118 ns/iter (+/- 7,444,620) = 18.5 MiB/s
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.
This library is provided under the GNU General Public License 3.0 or later,
as it contains the GPL-licensed code of the TFM-PVALUE algorithm. The TFM-PVALUE dependency can be disabled by disabling
the pvalue
crate feature, in which case the code can be used and redistributed under the terms
of the MIT license.
This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.