# splici a rust implementation of the splici algorithm to build spliced/unspliced transcripts ## Overview This implementation is written fully in rust and takes advantage of three bioinformatics libraries: 1. [`gtftools`](https://github.com/noamteyssier/gtftools/) - For parsing of GTF files 2. [`bedrs`](https://github.com/noamteyssier/bedrs) - For genomic interval arithmetic 3. [`faiquery`](https://github.com/noamteyssier/faiquery/) - For fast querying of indexed fastas ## Usage ```bash splici introns \ -f \ -g \ -o splici.fasta.gz; ``` This will generate a splici reference fasta using the transcripts and exons found within the gtf and will query from the indexed fasta provided. This expects that the fasta is indexed using [`samtools faidx`](http://www.htslib.org/doc/samtools-faidx.html). ### Getting Started You can download the latest ensembl DNA and GTF using [`ggetrs ensembl ref`](https://noamteyssier.github.io/ggetrs/ensembl/ref.html) ```bash ggetrs ensembl ref -D -d dna,gtf ``` Unzip and index the reference DNA. ```bash gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa ``` And then run `splici` to generate your splici reference fasta ```bash splici introns \ -f Homo_sapiens.GRCh38.dna.primary_assembly.fa \ -g Homo_sapiens.GRCh38.*.gtf.gz \ -o splici.fasta.gz; ``` ## Background The splici algorithm was described by (He et al. 2022) and is a shorthand for *spliced* + *intronic* sequences. It describes a method to isolate the intronic regions of all incoming transcripts and generate the sequences of both the spliced transcripts as well as their intronic components. The algorithm is applied on each gene individually. First all transcripts for a gene are identified. Then all intronic regions of those transcripts are identified. These intronic regions are defined by the span of the transcripts subtracting out the exonic intervals (see [internal](https://docs.rs/bedrs/latest/bedrs/traits/container/trait.Internal.html#method.internal)). Next, each intronic region is extended by some parameterized amount on both ends, which allows for alignment to junctions between intronic and exonic regions. Intronic regions between isoforms generally have high overlap, so a [merging](https://docs.rs/bedrs/latest/bedrs/traits/container/trait.Merge.html#method.merge) step is performed on the intronic regions to avoid redundant intervals in the final sequences. These intronic regions are then given a unique name and added to the splici reference. The spliced transcripts are generated by concatenating the exonic intervals for each transcript. These are named by the transcript id and added to the splici reference. ## References 1. He, D. et al. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat Methods 19, 316–322 (2022).