selexqc

Crates.io	selexqc
lib.rs	selexqc
version	0.1.0
created_at	2025-10-28 11:52:08.086155+00
updated_at	2025-10-28 11:52:08.086155+00
description	High-performance parallel RNA Capture-SELEX library quality control
homepage	https://github.com/mulatta/selexqc
repository	https://github.com/mulatta/selexqc
max_upload_size
id	1904634
size	89,469

mulatta (mulatta)

documentation

https://docs.rs/selexqc

README

selexqc - RNA Capture-SELEX Library Quality Control

High-performance quality control and filtering tool for RNA Capture-SELEX NGS libraries, written in Rust.

Features

⚡ Fast: Parallel processing with zero-copy parsing using needletail
📁 Format Support: FASTA, FASTQ, and gzipped formats (.fa, .fq, .fq.gz)
🔍 Case-Insensitive: Constant region matching ignores case
🏗️ Structure-Aware: Validates upstream/downstream regions as paired structures
🔀 Flexible Logic: AND (strict) or OR (lenient) validation modes
🎯 Filtering: Save valid sequences to file (FASTA/FASTQ/FASTQ.gz)
📊 MultiQC Integration: Generate MultiQC-compatible reports
🧵 Scalable: Multi-threaded processing for large datasets
💾 Memory Efficient: Streaming processing with minimal memory footprint

Installation

From Source (Cargo)

git clone https://github.com/mulatta/selexqc.git
cd selexqc
cargo build --release
# Binary: target/release/selexqc

Using Nix

nix build
# Binary: result/bin/selexqc

Quick Start

Basic Usage

# Check sequences for constant region presence
selexqc -i library.fq.gz -o results -c TGGCCACCAT

With Structure Validation (Recommended)

# N40:N10 library (40bp upstream + 10bp constant + 10bp downstream)
selexqc \
  -i library.fq.gz \
  -o results \
  -c TGGCCACCAT \
  --upstream-length 40 \
  --upstream-tolerance 2 \
  --downstream-length 10 \
  --downstream-tolerance 1 \
  --filter \
  --multiqc

Usage

selexqc [OPTIONS] --input <FILE> --output <PREFIX> --constant <SEQ>

Required Arguments

Argument	Description
`-i, --input <FILE>`	Input sequence file (FASTA/FASTQ, optionally gzipped)
`-o, --output <PREFIX>`	Output file prefix
`-c, --constant <SEQ>`	Constant region sequence (case-insensitive)

Validation Options

Option	Default	Description
`--validation-mode <MODE>`	`and`	Validation logic: and (strict) or or (lenient)
`--min-length <INT>`	-	Minimum total sequence length
`--max-length <INT>`	-	Maximum total sequence length
`--upstream-length <INT>`	-	Expected upstream length (before constant)
`--upstream-tolerance <INT>`	-	Upstream length tolerance (+/-)
`--downstream-length <INT>`	-	Expected downstream length (after constant)
`--downstream-tolerance <INT>`	-	Downstream length tolerance (+/-)
`-q, --min-quality <FLOAT>`	-	Minimum average quality score (FASTQ only)

Filtering Options

Option	Default	Description
`--filter`	disabled	Enable filtering (save valid sequences)
`--filter-format <FORMAT>`	`fasta`	Output format: fasta, fastq, or fastq.gz

Performance & Output

Option	Default	Description
`-t, --threads <INT>`	`4`	Number of threads for parallel processing
`-f, --formats <LIST>`	`txt,json`	Report formats (comma-separated: txt,csv,json)
`--multiqc`	disabled	Generate MultiQC-compatible report

Validation Logic

AND Mode (Strict) - Default

ALL criteria must pass for a sequence to be valid:

✓ Constant region present
✓ Total length in range (if specified)
✓ Upstream length correct (if specified)
✓ Downstream length correct (if specified)
✓ Quality sufficient (if specified)

Use case: Strict quality control for homogeneous libraries

OR Mode (Lenient)

ANY criterion can pass for a sequence to be valid:

✓ Constant region present OR
✓ Total length in range OR
✓ Upstream/downstream correct OR
✓ Quality sufficient

Use case: Mixed libraries or exploratory analysis

Structure Pair Validation

When both --upstream-length and --downstream-length are specified:

Validates the paired structure: upstream AND downstream together
Counts sequences where BOTH regions are within tolerance
Reports paired distribution: (upstream, downstream)

Example:

Expected: 40bp upstream + 10bp constant + 10bp downstream = 60bp total
Tolerance: ±2bp upstream, ±1bp downstream

✓ Valid: 40bp - TGGCCACCAT - 10bp  (exact match)
✓ Valid: 38bp - TGGCCACCAT - 11bp  (within tolerance)
✗ Invalid: 40bp - TGGCCACCAT - 15bp  (downstream too long)
✗ Invalid: 35bp - TGGCCACCAT - 10bp  (upstream too short)

Examples

Example 1: N40:N10 Library (Strict Validation)

Expected structure: 40bp variable + 10bp constant + 10bp variable = 60bp total

selexqc \
  -i library.fq.gz \
  -o n40n10_qc \
  -c TGGCCACCAT \
  --validation-mode and \
  --min-length 58 \
  --max-length 62 \
  --upstream-length 40 \
  --upstream-tolerance 2 \
  --downstream-length 10 \
  --downstream-tolerance 1 \
  --filter \
  --filter-format fastq.gz \
  --threads 16 \
  --multiqc

Output files:

n40n10_qc.validation.txt - Human-readable report
n40n10_qc.stats.json - Complete statistics
n40n10_qc.length_dist.csv - Length distribution
n40n10_qc.upstream_dist.csv - Upstream distribution
n40n10_qc.downstream_dist.csv - Downstream distribution
n40n10_qc.structure_pairs.csv - Paired structure distribution
n40n10_qc.filtered.fq.gz - Valid sequences only
n40n10_qc_mqc.json - MultiQC data

Example 2: N25:N25 Library

selexqc \
  -i library.fq.gz \
  -o n25n25_qc \
  -c TGGCCACCAT \
  --upstream-length 25 \
  --upstream-tolerance 2 \
  --downstream-length 25 \
  --downstream-tolerance 2 \
  --filter \
  --multiqc

Example 3: Mixed Library (OR Mode)

Accept sequences with ANY valid characteristic:

selexqc \
  -i mixed_library.fq.gz \
  -o mixed_qc \
  -c TGGCCACCAT \
  --validation-mode or \
  --min-length 50 \
  --max-length 70

Example 4: Quality Filtering

Filter by quality score (FASTQ only):

selexqc \
  -i raw.fastq.gz \
  -o qfiltered \
  -c TGGCCACCAT \
  --min-quality 30 \
  --filter \
  --filter-format fastq

Example 5: Structure Analysis Without Filtering

Analyze library structure without saving filtered sequences:

selexqc \
  -i library.fa \
  -o analysis \
  -c TGGCCACCAT \
  --upstream-length 40 \
  --downstream-length 10

# Review structure pairs
cat analysis.structure_pairs.csv

Example 6: Complete Pipeline with MultiQC

# Process multiple samples
for sample in sample1 sample2 sample3; do
  selexqc \
    -i ${sample}.fq.gz \
    -o qc/${sample} \
    -c TGGCCACCAT \
    --upstream-length 40 \
    --upstream-tolerance 2 \
    --downstream-length 10 \
    --downstream-tolerance 1 \
    --filter \
    --filter-format fastq.gz \
    --threads 8 \
    --multiqc
done

# Aggregate with MultiQC
multiqc qc/

# Use filtered outputs
nextflow run selex_pipeline \
  --input "qc/*.filtered.fq.gz"

Output Files

Reports

File	Format	Description
`.validation.txt`	Text	Human-readable summary with all statistics
`.stats.json`	JSON	Complete statistics (machine-readable)
`.length_dist.csv`	CSV	Total sequence length distribution
`.upstream_dist.csv`	CSV	Upstream region length distribution
`.downstream_dist.csv`	CSV	Downstream region length distribution
`.structure_pairs.csv`	CSV	Paired (upstream, downstream) distribution
`_mqc.json`	JSON	MultiQC-compatible data

Filtered Sequences

File	Format	Description
`.filtered.fa`	FASTA	Valid sequences (uncompressed)
`.filtered.fq`	FASTQ	Valid sequences with quality (uncompressed)
`.filtered.fq.gz`	FASTQ.gz	Valid sequences with quality (compressed)

MultiQC Report Contents

Total/valid/filtered sequence counts
Pass/filter rates
Validation mode (AND/OR)
Constant region detection rate
Structure validation statistics
Failure reason breakdown
Filter efficiency metrics

Understanding Results

Text Report Example

RNA Capture-SELEX Library Validation Report
======================================================================

Configuration:
  Constant region: TGGCCACCAT
  Validation mode: AND (strict)
  Total length range: 58 - 62 bp
  Expected upstream length: 40 bp (+/- 2)
  Expected downstream length: 10 bp (+/- 1)

Summary Statistics:
  Total sequences: 10471867
  Valid sequences: 9525680 (90.96%)
  Invalid (filtered) sequences: 946187 (9.04%)

Validation Results:
  Constant region present: 10450000 (99.79%)
  Correct total length: 10300000 (98.36%)
  Correct upstream: 9800000 (93.78% of sequences with constant)
  Correct downstream: 10200000 (97.61% of sequences with constant)
  Correct structure (paired): 9525680 (91.15% of sequences with constant)

Failure Reasons:
  Incorrect structure: 850320 (89.87% of invalid sequences)
  Incorrect total length: 171867 (18.16% of invalid sequences)
  Missing constant region: 21867 (2.31% of invalid sequences)

Structure Pair Distribution:
  (40, 10): 9000000 sequences (86.12%)
  (39, 11): 300000 sequences ( 2.87%)
  (41,  9): 225680 sequences ( 2.16%)
  ...

Structure Pair Distribution

Shows how many sequences have each (upstream, downstream) combination:

upstream_length,downstream_length,count,percentage
40,10,9000000,86.12
39,11,300000,2.87
41,9,225680,2.16
38,12,150000,1.43

This helps identify:

Most common library structure
Distribution of variants
Potential issues with library preparation

Performance

Optimizations:

Zero-copy parsing with needletail
Streaming processing (10K sequence chunks)
Lock-free atomic counters
Fast substring search with Boyer-Moore (via memchr)

Tips & Best Practices

Determining Optimal Parameters

Initial exploration (no structure validation):

selexqc -i library.fq.gz -o explore -c TGGCCACCAT
cat explore.validation.txt  # Review distributions

Check distributions in the report to see actual upstream/downstream lengths
Set tolerances based on observed distribution:
- 95% within range → use that as tolerance
- Large variation → consider OR mode or wider tolerance

Recommended Workflows

For homogeneous libraries (N40:N10, N25:N25):

Use AND mode (strict)
Set structure validation
Enable filtering
Use filtered output for downstream analysis

For mixed or exploratory libraries:

Start with OR mode
Review structure pairs distribution
Refine parameters based on results

For quality control in pipelines:

Enable MultiQC output
Use consistent parameters across samples
Compare QC metrics between samples

Batch Processing

# Process all samples in parallel (GNU parallel)
ls *.fq.gz | parallel -j 4 \
  'selexqc -i {} -o qc/{/.} -c TGGCCACCAT \
   --upstream-length 40 --downstream-length 10 \
   --filter --multiqc'

# Aggregate results
multiqc qc/

Troubleshooting

No Valid Sequences Found

Check:

Constant region sequence is correct
Constant region is in the same format (DNA/RNA, not complement)
Review failure statistics in .validation.txt

Debug:

# Verify constant region is present
grep -c "TGGCCACCAT" library.fa

# Check if case-sensitive issue (shouldn't be, but verify)
grep -i "tggccaccat" library.fa | head

Low Structure Match Rate

If paired structure validation fails:

Check structure_pairs.csv for actual distribution
If distribution shows near-misses, increase tolerances
Consider if library actually has mixed structures (use OR mode)