{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example\n",
    "\n",
    "This Jupyter notebook shows how to use the library with common examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import lightmotif\n",
    "lightmotif.__version__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from urllib.request import urlopen"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a motif\n",
    "\n",
    "A `Motif` can be created from several sequences of the same length using the\n",
    "`lightmotif.create` function. This first builds a `CountMatrix` from each \n",
    "sequence position, and then creates a `WeightMatrix` and a `ScoringMatrix`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "motif = lightmotif.create([\"AATTGTGGTTA\", \"ATCTGTGGTTA\", \"TTCTGCGGTTA\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading a motif\n",
    "\n",
    "The `lightmotif.load` function can be used to load the motifs found in a given\n",
    "file. Because it supports any file-like object, we can immediately download a\n",
    "motif from the [JASPAR](https://jaspar.elixir.no/) database and parse it on \n",
    "the fly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://jaspar.elixir.no/api/v1/matrix/MA0002.1.jaspar\"\n",
    "with urlopen(url) as response:\n",
    "    motif = next(lightmotif.load(response, format=\"jaspar16\"))\n",
    "    print(f\"Loaded motif {motif.name} of length {len(motif.counts)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding pseudo-counts\n",
    "\n",
    "By default, the loaded scoring matrix is built with zero pseudo-counts and \n",
    "a uniform background, which may not be ideal. Using the `CountMatrix.normalize`\n",
    "and `WeightMatrix.log_odds` methods, we can build a new `ScoringMatrix` with\n",
    "pseudo-counts of 0.1:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pssm = motif.counts.normalize(0.1).log_odds()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing a sequence\n",
    "\n",
    "Since the motif we loaded is a human transcription factor binding site, \n",
    "it makes sense to use a human sequence. As an example, we can load a \n",
    "contig from the human chromosome 22 ([NT_167212.2](https://www.ncbi.nlm.nih.gov/nuccore/NT_167212.2))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&report=fasta&id=568801992\"\n",
    "with urlopen(url) as response:\n",
    "    response.readline()\n",
    "    sequence = ''.join(line.strip().decode() for line in response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To score a sequence with `lightmotif`, if must be first encoded and stored with\n",
    "a particular memory layout. This is taken care of by the `lightmotif.stripe`\n",
    "function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "striped = lightmotif.stripe(sequence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculate scores\n",
    "\n",
    "Once the sequence has been prepared, it can be used with the different functions\n",
    "and methods of `lightmotif` to compute scores for each position. The most most\n",
    "basic functionality is to compute the PSSM scores for every position of the \n",
    "sequence. This can be done with the `ScoringMatrix.calculate` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = pssm.calculate(striped)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The scores are computed in an efficient column-major matrix which can be used\n",
    "to further extract high scoring positions:\n",
    "\n",
    "- The `argmax` method returns the smallest index with the highest score\n",
    "- The `max` method returns the highest score\n",
    "- The `threshold` method returns a list of positions with a score above the given score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Highest score: {scores.max():.3f}\")\n",
    "print(f\"Position with highest score: {scores.argmax()}\")\n",
    "print(f\"Position with score above 14: {scores.threshold(13.0)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Otherwise, the resulting array can be accessed by index, and flattened into\n",
    "a list (or an `array`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Score at position 90517:\", scores[156007])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using p-value thresholds\n",
    "\n",
    "LightMotif features a re-implementation of the TFP-PVALUE algorithm which \n",
    "can convert between a bitscore and a p-value for a given scoring matrix. Use\n",
    "the `ScoringMatrix.score` method to compute the score threshold for a *p-value*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Score threshold for p=1e-5: {pssm.score(1e-5):.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `ScoringMatrix.pvalue` method can compute the *p-value* for a score, allowing\n",
    "to compute them for scores obtained by the scoring pipeline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for index in scores.threshold(13.0):\n",
    "    print(f\"Hit at position {index:6}: score={scores[index]:.3f} p={pssm.pvalue(scores[index]):.3g}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scanning algorithm\n",
    "\n",
    "For cases where a long sequence is being processed, and only a handful of \n",
    "significative hits is expected, using a scanner will be much more efficient.\n",
    "A `Scanner` can be created with the `lightmotif.scan` function, and yields\n",
    "`Hit` objects for every position above the threshold parameter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scanner = lightmotif.scan(pssm, striped, threshold=13.0)\n",
    "for hit in scanner:\n",
    "    print(f\"Hit at position {hit.position:6}: score={hit.score:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although it gives equivalent results to the `calculate` example above, the \n",
    "`scan` implementation uses less memory and is generally faster for higher\n",
    "threshold values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reverse-complement\n",
    "\n",
    "All the examples above are showing how to calculate the hits for the direct \n",
    "strand. To process the reverse-strand, one could reverse-complement the sequence;\n",
    "however, it is much more efficient to reverse-complement the `ScoringMatrix`, \n",
    "as it is usually much smaller in memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "psmm_rc = pssm.reverse_complement()\n",
    "scanner_rc = lightmotif.scan(psmm_rc, striped, threshold=13.0)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}