{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example\n", "\n", "This Jupyter notebook shows how to use the library with common examples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import lightmotif\n", "lightmotif.__version__" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from urllib.request import urlopen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a motif\n", "\n", "A `Motif` can be created from several sequences of the same length using the\n", "`lightmotif.create` function. This first builds a `CountMatrix` from each \n", "sequence position, and then creates a `WeightMatrix` and a `ScoringMatrix`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "motif = lightmotif.create([\"AATTGTGGTTA\", \"ATCTGTGGTTA\", \"TTCTGCGGTTA\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading a motif\n", "\n", "The `lightmotif.load` function can be used to load the motifs found in a given\n", "file. Because it supports any file-like object, we can immediately download a\n", "motif from the [JASPAR](https://jaspar.elixir.no/) database and parse it on \n", "the fly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = \"https://jaspar.elixir.no/api/v1/matrix/MA0002.1.jaspar\"\n", "with urlopen(url) as response:\n", " motif = next(lightmotif.load(response, format=\"jaspar16\"))\n", " print(f\"Loaded motif {motif.name} of length {len(motif.counts)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Adding pseudo-counts\n", "\n", "By default, the loaded scoring matrix is built with zero pseudo-counts and \n", "a uniform background, which may not be ideal. Using the `CountMatrix.normalize`\n", "and `WeightMatrix.log_odds` methods, we can build a new `ScoringMatrix` with\n", "pseudo-counts of 0.1:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pssm = motif.counts.normalize(0.1).log_odds()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing a sequence\n", "\n", "Since the motif we loaded is a human transcription factor binding site, \n", "it makes sense to use a human sequence. As an example, we can load a \n", "contig from the human chromosome 22 ([NT_167212.2](https://www.ncbi.nlm.nih.gov/nuccore/NT_167212.2))." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = \"https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&report=fasta&id=568801992\"\n", "with urlopen(url) as response:\n", " response.readline()\n", " sequence = ''.join(line.strip().decode() for line in response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To score a sequence with `lightmotif`, if must be first encoded and stored with\n", "a particular memory layout. This is taken care of by the `lightmotif.stripe`\n", "function. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "striped = lightmotif.stripe(sequence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate scores\n", "\n", "Once the sequence has been prepared, it can be used with the different functions\n", "and methods of `lightmotif` to compute scores for each position. The most most\n", "basic functionality is to compute the PSSM scores for every position of the \n", "sequence. This can be done with the `ScoringMatrix.calculate` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scores = pssm.calculate(striped)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The scores are computed in an efficient column-major matrix which can be used\n", "to further extract high scoring positions:\n", "\n", "- The `argmax` method returns the smallest index with the highest score\n", "- The `max` method returns the highest score\n", "- The `threshold` method returns a list of positions with a score above the given score" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Highest score: {scores.max():.3f}\")\n", "print(f\"Position with highest score: {scores.argmax()}\")\n", "print(f\"Position with score above 14: {scores.threshold(13.0)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Otherwise, the resulting array can be accessed by index, and flattened into\n", "a list (or an `array`):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Score at position 90517:\", scores[156007])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using p-value thresholds\n", "\n", "LightMotif features a re-implementation of the TFP-PVALUE algorithm which \n", "can convert between a bitscore and a p-value for a given scoring matrix. Use\n", "the `ScoringMatrix.score` method to compute the score threshold for a *p-value*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Score threshold for p=1e-5: {pssm.score(1e-5):.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ScoringMatrix.pvalue` method can compute the *p-value* for a score, allowing\n", "to compute them for scores obtained by the scoring pipeline:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for index in scores.threshold(13.0):\n", " print(f\"Hit at position {index:6}: score={scores[index]:.3f} p={pssm.pvalue(scores[index]):.3g}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scanning algorithm\n", "\n", "For cases where a long sequence is being processed, and only a handful of \n", "significative hits is expected, using a scanner will be much more efficient.\n", "A `Scanner` can be created with the `lightmotif.scan` function, and yields\n", "`Hit` objects for every position above the threshold parameter:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scanner = lightmotif.scan(pssm, striped, threshold=13.0)\n", "for hit in scanner:\n", " print(f\"Hit at position {hit.position:6}: score={hit.score:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although it gives equivalent results to the `calculate` example above, the \n", "`scan` implementation uses less memory and is generally faster for higher\n", "threshold values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reverse-complement\n", "\n", "All the examples above are showing how to calculate the hits for the direct \n", "strand. To process the reverse-strand, one could reverse-complement the sequence;\n", "however, it is much more efficient to reverse-complement the `ScoringMatrix`, \n", "as it is usually much smaller in memory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "psmm_rc = pssm.reverse_complement()\n", "scanner_rc = lightmotif.scan(psmm_rc, striped, threshold=13.0)" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }