scispeak

Crates.ioscispeak
lib.rsscispeak
version0.1.0
sourcesrc
created_at2023-10-11 02:14:46.050426
updated_at2023-10-11 02:14:46.050426
descriptionA tool for converting Sci-RNA-Seq3 files to 10X Genomics compatible FASTQ files.
homepage
repositoryhttps://github.com/noamteyssier/scispeak
max_upload_size
id999690
size185,565
Noam Teyssier (noamteyssier)

documentation

README

scispeak

a rust parser to convert sci-seq-v3 reads into kallisto compatible formats

a CLI tool to whitelist filter sci-seq-v3 reads and convert them to a 10X-style format.

Overview

This tool is used to filter sciseq reads against their respective barcode whitelists and then output fastq file formats in the style of 10X reads.

This parses the sci-seq-v3 format, identifies the cell barcodes and UMIs and writes out a new file to resemble the 10X sequence construct to be used with other tools that have not yet adopted the sci-seq format.

sci-rna-seq3 Sequencing Construct

The sci-rna-seq3 sequencing construct is organized in the following way:

            ┌─'illumina_p5:29'
            ├─'i5:10'
            ├─'truseq_read_1_adapter:33'
            │                            ┌─'hairpin_barcode:10'
            │                            ├─'hairpin_adapter:6'
            ├─read_1─────────────────────┤
            │                            ├─'umi:8'
──RNA───────┤                            └─'cell_bc:10'
            ├─'poly_T:98'
            ├─'read_2:98'
            │                            ┌─'ME:19'
            ├─i7_primer──────────────────┤
            │                            └─'s7:15'
            ├─'i7:10'
            └─'illumina_p7:24'

Visualization from seqspec.

And so the resulting R1 and R2 files boil down to:

# R1
[linker][adapter][umi][barcode]

# R2
[cDNA]

Usage

This is a single command CLI tool. It requires just the R1 and R2 filepaths

scispeak \
    -i data/SRR7827205_sample_R1.fastq.gz \
    -I data/SRR7827205_sample_R2.fastq.gz;

However, it can be accelerated using multiple compression threads:

scispeak \
    -i data/SRR7827205_sample_R1.fastq.gz \
    -I data/SRR7827205_sample_R2.fastq.gz \
    -t 8;

And can store a log file as well to keep matching statistics:

scispeak \
    -i data/SRR7827205_sample_R1.fastq.gz \
    -I data/SRR7827205_sample_R2.fastq.gz \
    -t 8 \
    -l;

Outputs

This program will output 3 files per run:

  1. <args.prefix>_R1.fastq.gz: A fastq with the [barcode][UMI] construct for all reads passing the whitelist.
  2. <args.prefix>_R2.fastq.gz: An unaltered fastq of the R2 for all reads passing the whitelist.
  3. <args.prefix>_log.json: A log file containing the filtering statistics of the run.
Commit count: 11

cargo fmt