# splici

a rust implementation of the splici algorithm to build spliced/unspliced transcripts

## Overview

This implementation is written fully in rust and takes advantage of three bioinformatics
libraries:

1. [`gtftools`](https://github.com/noamteyssier/gtftools/) - For parsing of GTF files
2. [`bedrs`](https://github.com/noamteyssier/bedrs) - For genomic interval arithmetic
3. [`faiquery`](https://github.com/noamteyssier/faiquery/) - For fast querying of indexed fastas

## Usage

```bash
splici introns \
    -f <your.fasta> \
    -g <your.gtf> \
    -o splici.fasta.gz;
```

This will generate a splici reference fasta using the transcripts and exons found
within the gtf and will query from the indexed fasta provided.

This expects that the fasta is indexed using [`samtools faidx`](http://www.htslib.org/doc/samtools-faidx.html).

### Getting Started

You can download the latest ensembl DNA and GTF using [`ggetrs ensembl ref`](https://noamteyssier.github.io/ggetrs/ensembl/ref.html)

```bash
ggetrs ensembl ref -D -d dna,gtf
```

Unzip and index the reference DNA.

```bash
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 
samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa
```

And then run `splici` to generate your splici reference fasta

```bash
splici introns \
    -f Homo_sapiens.GRCh38.dna.primary_assembly.fa \
    -g Homo_sapiens.GRCh38.*.gtf.gz \
    -o splici.fasta.gz;
```

## Background

The splici algorithm was described by (He et al. 2022) and is a shorthand for
*spliced* + *intronic* sequences.

It describes a method to isolate the intronic regions of all incoming transcripts
and generate the sequences of both the spliced transcripts as well as their intronic
components.

The algorithm is applied on each gene individually.

First all transcripts for a gene are identified.
Then all intronic regions of those transcripts are identified.
These intronic regions are defined by the span of the transcripts
subtracting out the exonic intervals (see [internal](https://docs.rs/bedrs/latest/bedrs/traits/container/trait.Internal.html#method.internal)).
Next, each intronic region is extended by some parameterized amount
on both ends, which allows for alignment to junctions between intronic
and exonic regions.
Intronic regions between isoforms generally have high overlap, so a
[merging](https://docs.rs/bedrs/latest/bedrs/traits/container/trait.Merge.html#method.merge)
step is performed on the intronic regions to avoid redundant
intervals in the final sequences.
These intronic regions are then given a unique name and added to the
splici reference.

The spliced transcripts are generated by concatenating the exonic intervals
for each transcript.
These are named by the transcript id and added to the splici reference.

## References

1. He, D. et al. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat Methods 19, 316–322 (2022).