selexqc

Crates.ioselexqc
lib.rsselexqc
version0.1.0
created_at2025-10-28 11:52:08.086155+00
updated_at2025-10-28 11:52:08.086155+00
descriptionHigh-performance parallel RNA Capture-SELEX library quality control
homepagehttps://github.com/mulatta/selexqc
repositoryhttps://github.com/mulatta/selexqc
max_upload_size
id1904634
size89,469
mulatta (mulatta)

documentation

https://docs.rs/selexqc

README

selexqc - RNA Capture-SELEX Library Quality Control

High-performance quality control and filtering tool for RNA Capture-SELEX NGS libraries, written in Rust.

License: MIT

Features

  • โšก Fast: Parallel processing with zero-copy parsing using needletail
  • ๐Ÿ“ Format Support: FASTA, FASTQ, and gzipped formats (.fa, .fq, .fq.gz)
  • ๐Ÿ” Case-Insensitive: Constant region matching ignores case
  • ๐Ÿ—๏ธ Structure-Aware: Validates upstream/downstream regions as paired structures
  • ๐Ÿ”€ Flexible Logic: AND (strict) or OR (lenient) validation modes
  • ๐ŸŽฏ Filtering: Save valid sequences to file (FASTA/FASTQ/FASTQ.gz)
  • ๐Ÿ“Š MultiQC Integration: Generate MultiQC-compatible reports
  • ๐Ÿงต Scalable: Multi-threaded processing for large datasets
  • ๐Ÿ’พ Memory Efficient: Streaming processing with minimal memory footprint

Installation

From Source (Cargo)

git clone https://github.com/mulatta/selexqc.git
cd selexqc
cargo build --release
# Binary: target/release/selexqc

Using Nix

nix build
# Binary: result/bin/selexqc

Quick Start

Basic Usage

# Check sequences for constant region presence
selexqc -i library.fq.gz -o results -c TGGCCACCAT

With Structure Validation (Recommended)

# N40:N10 library (40bp upstream + 10bp constant + 10bp downstream)
selexqc \
  -i library.fq.gz \
  -o results \
  -c TGGCCACCAT \
  --upstream-length 40 \
  --upstream-tolerance 2 \
  --downstream-length 10 \
  --downstream-tolerance 1 \
  --filter \
  --multiqc

Usage

selexqc [OPTIONS] --input <FILE> --output <PREFIX> --constant <SEQ>

Required Arguments

Argument Description
-i, --input <FILE> Input sequence file (FASTA/FASTQ, optionally gzipped)
-o, --output <PREFIX> Output file prefix
-c, --constant <SEQ> Constant region sequence (case-insensitive)

Validation Options

Option Default Description
--validation-mode <MODE> and Validation logic: and (strict) or or (lenient)
--min-length <INT> - Minimum total sequence length
--max-length <INT> - Maximum total sequence length
--upstream-length <INT> - Expected upstream length (before constant)
--upstream-tolerance <INT> - Upstream length tolerance (+/-)
--downstream-length <INT> - Expected downstream length (after constant)
--downstream-tolerance <INT> - Downstream length tolerance (+/-)
-q, --min-quality <FLOAT> - Minimum average quality score (FASTQ only)

Filtering Options

Option Default Description
--filter disabled Enable filtering (save valid sequences)
--filter-format <FORMAT> fasta Output format: fasta, fastq, or fastq.gz

Performance & Output

Option Default Description
-t, --threads <INT> 4 Number of threads for parallel processing
-f, --formats <LIST> txt,json Report formats (comma-separated: txt,csv,json)
--multiqc disabled Generate MultiQC-compatible report

Validation Logic

AND Mode (Strict) - Default

ALL criteria must pass for a sequence to be valid:

  • โœ“ Constant region present
  • โœ“ Total length in range (if specified)
  • โœ“ Upstream length correct (if specified)
  • โœ“ Downstream length correct (if specified)
  • โœ“ Quality sufficient (if specified)

Use case: Strict quality control for homogeneous libraries

OR Mode (Lenient)

ANY criterion can pass for a sequence to be valid:

  • โœ“ Constant region present OR
  • โœ“ Total length in range OR
  • โœ“ Upstream/downstream correct OR
  • โœ“ Quality sufficient

Use case: Mixed libraries or exploratory analysis

Structure Pair Validation

When both --upstream-length and --downstream-length are specified:

  • Validates the paired structure: upstream AND downstream together
  • Counts sequences where BOTH regions are within tolerance
  • Reports paired distribution: (upstream, downstream)

Example:

Expected: 40bp upstream + 10bp constant + 10bp downstream = 60bp total
Tolerance: ยฑ2bp upstream, ยฑ1bp downstream

โœ“ Valid: 40bp - TGGCCACCAT - 10bp  (exact match)
โœ“ Valid: 38bp - TGGCCACCAT - 11bp  (within tolerance)
โœ— Invalid: 40bp - TGGCCACCAT - 15bp  (downstream too long)
โœ— Invalid: 35bp - TGGCCACCAT - 10bp  (upstream too short)

Examples

Example 1: N40:N10 Library (Strict Validation)

Expected structure: 40bp variable + 10bp constant + 10bp variable = 60bp total

selexqc \
  -i library.fq.gz \
  -o n40n10_qc \
  -c TGGCCACCAT \
  --validation-mode and \
  --min-length 58 \
  --max-length 62 \
  --upstream-length 40 \
  --upstream-tolerance 2 \
  --downstream-length 10 \
  --downstream-tolerance 1 \
  --filter \
  --filter-format fastq.gz \
  --threads 16 \
  --multiqc

Output files:

  • n40n10_qc.validation.txt - Human-readable report
  • n40n10_qc.stats.json - Complete statistics
  • n40n10_qc.length_dist.csv - Length distribution
  • n40n10_qc.upstream_dist.csv - Upstream distribution
  • n40n10_qc.downstream_dist.csv - Downstream distribution
  • n40n10_qc.structure_pairs.csv - Paired structure distribution
  • n40n10_qc.filtered.fq.gz - Valid sequences only
  • n40n10_qc_mqc.json - MultiQC data

Example 2: N25:N25 Library

selexqc \
  -i library.fq.gz \
  -o n25n25_qc \
  -c TGGCCACCAT \
  --upstream-length 25 \
  --upstream-tolerance 2 \
  --downstream-length 25 \
  --downstream-tolerance 2 \
  --filter \
  --multiqc

Example 3: Mixed Library (OR Mode)

Accept sequences with ANY valid characteristic:

selexqc \
  -i mixed_library.fq.gz \
  -o mixed_qc \
  -c TGGCCACCAT \
  --validation-mode or \
  --min-length 50 \
  --max-length 70

Example 4: Quality Filtering

Filter by quality score (FASTQ only):

selexqc \
  -i raw.fastq.gz \
  -o qfiltered \
  -c TGGCCACCAT \
  --min-quality 30 \
  --filter \
  --filter-format fastq

Example 5: Structure Analysis Without Filtering

Analyze library structure without saving filtered sequences:

selexqc \
  -i library.fa \
  -o analysis \
  -c TGGCCACCAT \
  --upstream-length 40 \
  --downstream-length 10

# Review structure pairs
cat analysis.structure_pairs.csv

Example 6: Complete Pipeline with MultiQC

# Process multiple samples
for sample in sample1 sample2 sample3; do
  selexqc \
    -i ${sample}.fq.gz \
    -o qc/${sample} \
    -c TGGCCACCAT \
    --upstream-length 40 \
    --upstream-tolerance 2 \
    --downstream-length 10 \
    --downstream-tolerance 1 \
    --filter \
    --filter-format fastq.gz \
    --threads 8 \
    --multiqc
done

# Aggregate with MultiQC
multiqc qc/

# Use filtered outputs
nextflow run selex_pipeline \
  --input "qc/*.filtered.fq.gz"

Output Files

Reports

File Format Description
.validation.txt Text Human-readable summary with all statistics
.stats.json JSON Complete statistics (machine-readable)
.length_dist.csv CSV Total sequence length distribution
.upstream_dist.csv CSV Upstream region length distribution
.downstream_dist.csv CSV Downstream region length distribution
.structure_pairs.csv CSV Paired (upstream, downstream) distribution
_mqc.json JSON MultiQC-compatible data

Filtered Sequences

File Format Description
.filtered.fa FASTA Valid sequences (uncompressed)
.filtered.fq FASTQ Valid sequences with quality (uncompressed)
.filtered.fq.gz FASTQ.gz Valid sequences with quality (compressed)

MultiQC Report Contents

  • Total/valid/filtered sequence counts
  • Pass/filter rates
  • Validation mode (AND/OR)
  • Constant region detection rate
  • Structure validation statistics
  • Failure reason breakdown
  • Filter efficiency metrics

Understanding Results

Text Report Example

RNA Capture-SELEX Library Validation Report
======================================================================

Configuration:
  Constant region: TGGCCACCAT
  Validation mode: AND (strict)
  Total length range: 58 - 62 bp
  Expected upstream length: 40 bp (+/- 2)
  Expected downstream length: 10 bp (+/- 1)

Summary Statistics:
  Total sequences: 10471867
  Valid sequences: 9525680 (90.96%)
  Invalid (filtered) sequences: 946187 (9.04%)

Validation Results:
  Constant region present: 10450000 (99.79%)
  Correct total length: 10300000 (98.36%)
  Correct upstream: 9800000 (93.78% of sequences with constant)
  Correct downstream: 10200000 (97.61% of sequences with constant)
  Correct structure (paired): 9525680 (91.15% of sequences with constant)

Failure Reasons:
  Incorrect structure: 850320 (89.87% of invalid sequences)
  Incorrect total length: 171867 (18.16% of invalid sequences)
  Missing constant region: 21867 (2.31% of invalid sequences)

Structure Pair Distribution:
  (40, 10): 9000000 sequences (86.12%)
  (39, 11): 300000 sequences ( 2.87%)
  (41,  9): 225680 sequences ( 2.16%)
  ...

Structure Pair Distribution

Shows how many sequences have each (upstream, downstream) combination:

upstream_length,downstream_length,count,percentage
40,10,9000000,86.12
39,11,300000,2.87
41,9,225680,2.16
38,12,150000,1.43

This helps identify:

  • Most common library structure
  • Distribution of variants
  • Potential issues with library preparation

Performance

Optimizations:

  • Zero-copy parsing with needletail
  • Streaming processing (10K sequence chunks)
  • Lock-free atomic counters
  • Fast substring search with Boyer-Moore (via memchr)

Tips & Best Practices

Determining Optimal Parameters

  1. Initial exploration (no structure validation):
selexqc -i library.fq.gz -o explore -c TGGCCACCAT
cat explore.validation.txt  # Review distributions
  1. Check distributions in the report to see actual upstream/downstream lengths

  2. Set tolerances based on observed distribution:

    • 95% within range โ†’ use that as tolerance
    • Large variation โ†’ consider OR mode or wider tolerance

Recommended Workflows

For homogeneous libraries (N40:N10, N25:N25):

  • Use AND mode (strict)
  • Set structure validation
  • Enable filtering
  • Use filtered output for downstream analysis

For mixed or exploratory libraries:

  • Start with OR mode
  • Review structure pairs distribution
  • Refine parameters based on results

For quality control in pipelines:

  • Enable MultiQC output
  • Use consistent parameters across samples
  • Compare QC metrics between samples

Batch Processing

# Process all samples in parallel (GNU parallel)
ls *.fq.gz | parallel -j 4 \
  'selexqc -i {} -o qc/{/.} -c TGGCCACCAT \
   --upstream-length 40 --downstream-length 10 \
   --filter --multiqc'

# Aggregate results
multiqc qc/

Troubleshooting

No Valid Sequences Found

Check:

  1. Constant region sequence is correct
  2. Constant region is in the same format (DNA/RNA, not complement)
  3. Review failure statistics in .validation.txt

Debug:

# Verify constant region is present
grep -c "TGGCCACCAT" library.fa

# Check if case-sensitive issue (shouldn't be, but verify)
grep -i "tggccaccat" library.fa | head

Low Structure Match Rate

If paired structure validation fails:

  1. Check structure_pairs.csv for actual distribution
  2. If distribution shows near-misses, increase tolerances
  3. Consider if library actually has mixed structures (use OR mode)

Example:

# Check top structure pairs
head -20 results.structure_pairs.csv

# If you see (38,12) and (42,8) frequently, consider:
# - Increasing tolerance, OR
# - Using OR mode, OR
# - Library has multiple intended structures

Memory Issues

For very large files (>100M reads):

# Reduce thread count to lower memory usage
selexqc -i huge.fq.gz -o output -c CONST --threads 4

# Or process in chunks (external tool)
split -l 40000000 huge.fq huge_chunk_
for chunk in huge_chunk_*; do
  selexqc -i $chunk -o qc/$(basename $chunk) -c CONST
done

Technical Details

Dependencies

  • needletail - Fast FASTA/FASTQ parsing
  • rayon - Data parallelism
  • memchr - Fast substring search (Boyer-Moore-Horspool)
  • flate2 - Gzip compression/decompression
  • serde / serde_json - Serialization
  • csv - CSV writing

Algorithms

Constant Region Search:

  • Boyer-Moore-Horspool via memchr::memmem
  • O(n) average case, O(nm) worst case
  • SIMD-accelerated when available

Parallel Processing:

  • Chunk-based processing (10K sequences per chunk)
  • Rayon work-stealing thread pool
  • Atomic counters for lock-free statistics
  • Minimal memory overhead

Quality Calculation (FASTQ):

  • Phred quality: Q = -10 * log10(P)
  • ASCII conversion: Q = ASCII - 33
  • Average across all positions

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

MIT License - see LICENSE file for details

Citation

If you use selexqc in your research, please cite:

selexqc: High-performance quality control for RNA Capture-SELEX libraries
https://github.com/mulatta/selexqc

Support

Changelog

v0.1.0 (Initial Release)

  • Fast parallel processing of FASTA/FASTQ/FASTQ.gz
  • Case-insensitive constant region detection
  • Structure validation (upstream/downstream pairs)
  • AND/OR validation modes
  • Filtering with multiple output formats
  • MultiQC integration
  • Comprehensive reporting (TXT/CSV/JSON)
Commit count: 0

cargo fmt