| Crates.io | selexqc |
| lib.rs | selexqc |
| version | 0.1.0 |
| created_at | 2025-10-28 11:52:08.086155+00 |
| updated_at | 2025-10-28 11:52:08.086155+00 |
| description | High-performance parallel RNA Capture-SELEX library quality control |
| homepage | https://github.com/mulatta/selexqc |
| repository | https://github.com/mulatta/selexqc |
| max_upload_size | |
| id | 1904634 |
| size | 89,469 |
High-performance quality control and filtering tool for RNA Capture-SELEX NGS libraries, written in Rust.
needletail.fa, .fq, .fq.gz)git clone https://github.com/mulatta/selexqc.git
cd selexqc
cargo build --release
# Binary: target/release/selexqc
nix build
# Binary: result/bin/selexqc
# Check sequences for constant region presence
selexqc -i library.fq.gz -o results -c TGGCCACCAT
# N40:N10 library (40bp upstream + 10bp constant + 10bp downstream)
selexqc \
-i library.fq.gz \
-o results \
-c TGGCCACCAT \
--upstream-length 40 \
--upstream-tolerance 2 \
--downstream-length 10 \
--downstream-tolerance 1 \
--filter \
--multiqc
selexqc [OPTIONS] --input <FILE> --output <PREFIX> --constant <SEQ>
| Argument | Description |
|---|---|
-i, --input <FILE> |
Input sequence file (FASTA/FASTQ, optionally gzipped) |
-o, --output <PREFIX> |
Output file prefix |
-c, --constant <SEQ> |
Constant region sequence (case-insensitive) |
| Option | Default | Description |
|---|---|---|
--validation-mode <MODE> |
and |
Validation logic: and (strict) or or (lenient) |
--min-length <INT> |
- | Minimum total sequence length |
--max-length <INT> |
- | Maximum total sequence length |
--upstream-length <INT> |
- | Expected upstream length (before constant) |
--upstream-tolerance <INT> |
- | Upstream length tolerance (+/-) |
--downstream-length <INT> |
- | Expected downstream length (after constant) |
--downstream-tolerance <INT> |
- | Downstream length tolerance (+/-) |
-q, --min-quality <FLOAT> |
- | Minimum average quality score (FASTQ only) |
| Option | Default | Description |
|---|---|---|
--filter |
disabled | Enable filtering (save valid sequences) |
--filter-format <FORMAT> |
fasta |
Output format: fasta, fastq, or fastq.gz |
| Option | Default | Description |
|---|---|---|
-t, --threads <INT> |
4 |
Number of threads for parallel processing |
-f, --formats <LIST> |
txt,json |
Report formats (comma-separated: txt,csv,json) |
--multiqc |
disabled | Generate MultiQC-compatible report |
ALL criteria must pass for a sequence to be valid:
Use case: Strict quality control for homogeneous libraries
ANY criterion can pass for a sequence to be valid:
Use case: Mixed libraries or exploratory analysis
When both --upstream-length and --downstream-length are specified:
(upstream, downstream)Example:
Expected: 40bp upstream + 10bp constant + 10bp downstream = 60bp total
Tolerance: ยฑ2bp upstream, ยฑ1bp downstream
โ Valid: 40bp - TGGCCACCAT - 10bp (exact match)
โ Valid: 38bp - TGGCCACCAT - 11bp (within tolerance)
โ Invalid: 40bp - TGGCCACCAT - 15bp (downstream too long)
โ Invalid: 35bp - TGGCCACCAT - 10bp (upstream too short)
Expected structure: 40bp variable + 10bp constant + 10bp variable = 60bp total
selexqc \
-i library.fq.gz \
-o n40n10_qc \
-c TGGCCACCAT \
--validation-mode and \
--min-length 58 \
--max-length 62 \
--upstream-length 40 \
--upstream-tolerance 2 \
--downstream-length 10 \
--downstream-tolerance 1 \
--filter \
--filter-format fastq.gz \
--threads 16 \
--multiqc
Output files:
n40n10_qc.validation.txt - Human-readable reportn40n10_qc.stats.json - Complete statisticsn40n10_qc.length_dist.csv - Length distributionn40n10_qc.upstream_dist.csv - Upstream distributionn40n10_qc.downstream_dist.csv - Downstream distributionn40n10_qc.structure_pairs.csv - Paired structure distributionn40n10_qc.filtered.fq.gz - Valid sequences onlyn40n10_qc_mqc.json - MultiQC dataselexqc \
-i library.fq.gz \
-o n25n25_qc \
-c TGGCCACCAT \
--upstream-length 25 \
--upstream-tolerance 2 \
--downstream-length 25 \
--downstream-tolerance 2 \
--filter \
--multiqc
Accept sequences with ANY valid characteristic:
selexqc \
-i mixed_library.fq.gz \
-o mixed_qc \
-c TGGCCACCAT \
--validation-mode or \
--min-length 50 \
--max-length 70
Filter by quality score (FASTQ only):
selexqc \
-i raw.fastq.gz \
-o qfiltered \
-c TGGCCACCAT \
--min-quality 30 \
--filter \
--filter-format fastq
Analyze library structure without saving filtered sequences:
selexqc \
-i library.fa \
-o analysis \
-c TGGCCACCAT \
--upstream-length 40 \
--downstream-length 10
# Review structure pairs
cat analysis.structure_pairs.csv
# Process multiple samples
for sample in sample1 sample2 sample3; do
selexqc \
-i ${sample}.fq.gz \
-o qc/${sample} \
-c TGGCCACCAT \
--upstream-length 40 \
--upstream-tolerance 2 \
--downstream-length 10 \
--downstream-tolerance 1 \
--filter \
--filter-format fastq.gz \
--threads 8 \
--multiqc
done
# Aggregate with MultiQC
multiqc qc/
# Use filtered outputs
nextflow run selex_pipeline \
--input "qc/*.filtered.fq.gz"
| File | Format | Description |
|---|---|---|
.validation.txt |
Text | Human-readable summary with all statistics |
.stats.json |
JSON | Complete statistics (machine-readable) |
.length_dist.csv |
CSV | Total sequence length distribution |
.upstream_dist.csv |
CSV | Upstream region length distribution |
.downstream_dist.csv |
CSV | Downstream region length distribution |
.structure_pairs.csv |
CSV | Paired (upstream, downstream) distribution |
_mqc.json |
JSON | MultiQC-compatible data |
| File | Format | Description |
|---|---|---|
.filtered.fa |
FASTA | Valid sequences (uncompressed) |
.filtered.fq |
FASTQ | Valid sequences with quality (uncompressed) |
.filtered.fq.gz |
FASTQ.gz | Valid sequences with quality (compressed) |
RNA Capture-SELEX Library Validation Report
======================================================================
Configuration:
Constant region: TGGCCACCAT
Validation mode: AND (strict)
Total length range: 58 - 62 bp
Expected upstream length: 40 bp (+/- 2)
Expected downstream length: 10 bp (+/- 1)
Summary Statistics:
Total sequences: 10471867
Valid sequences: 9525680 (90.96%)
Invalid (filtered) sequences: 946187 (9.04%)
Validation Results:
Constant region present: 10450000 (99.79%)
Correct total length: 10300000 (98.36%)
Correct upstream: 9800000 (93.78% of sequences with constant)
Correct downstream: 10200000 (97.61% of sequences with constant)
Correct structure (paired): 9525680 (91.15% of sequences with constant)
Failure Reasons:
Incorrect structure: 850320 (89.87% of invalid sequences)
Incorrect total length: 171867 (18.16% of invalid sequences)
Missing constant region: 21867 (2.31% of invalid sequences)
Structure Pair Distribution:
(40, 10): 9000000 sequences (86.12%)
(39, 11): 300000 sequences ( 2.87%)
(41, 9): 225680 sequences ( 2.16%)
...
Shows how many sequences have each (upstream, downstream) combination:
upstream_length,downstream_length,count,percentage
40,10,9000000,86.12
39,11,300000,2.87
41,9,225680,2.16
38,12,150000,1.43
This helps identify:
Optimizations:
needletailmemchr)selexqc -i library.fq.gz -o explore -c TGGCCACCAT
cat explore.validation.txt # Review distributions
Check distributions in the report to see actual upstream/downstream lengths
Set tolerances based on observed distribution:
For homogeneous libraries (N40:N10, N25:N25):
For mixed or exploratory libraries:
For quality control in pipelines:
# Process all samples in parallel (GNU parallel)
ls *.fq.gz | parallel -j 4 \
'selexqc -i {} -o qc/{/.} -c TGGCCACCAT \
--upstream-length 40 --downstream-length 10 \
--filter --multiqc'
# Aggregate results
multiqc qc/
Check:
.validation.txtDebug:
# Verify constant region is present
grep -c "TGGCCACCAT" library.fa
# Check if case-sensitive issue (shouldn't be, but verify)
grep -i "tggccaccat" library.fa | head
If paired structure validation fails:
structure_pairs.csv for actual distributionExample:
# Check top structure pairs
head -20 results.structure_pairs.csv
# If you see (38,12) and (42,8) frequently, consider:
# - Increasing tolerance, OR
# - Using OR mode, OR
# - Library has multiple intended structures
For very large files (>100M reads):
# Reduce thread count to lower memory usage
selexqc -i huge.fq.gz -o output -c CONST --threads 4
# Or process in chunks (external tool)
split -l 40000000 huge.fq huge_chunk_
for chunk in huge_chunk_*; do
selexqc -i $chunk -o qc/$(basename $chunk) -c CONST
done
needletail - Fast FASTA/FASTQ parsingrayon - Data parallelismmemchr - Fast substring search (Boyer-Moore-Horspool)flate2 - Gzip compression/decompressionserde / serde_json - Serializationcsv - CSV writingConstant Region Search:
memchr::memmemParallel Processing:
Quality Calculation (FASTQ):
Contributions are welcome! Please:
MIT License - see LICENSE file for details
If you use selexqc in your research, please cite:
selexqc: High-performance quality control for RNA Capture-SELEX libraries
https://github.com/mulatta/selexqc