| Crates.io | seqtable |
| lib.rs | seqtable |
| version | 0.1.1 |
| created_at | 2025-10-30 02:38:38.245422+00 |
| updated_at | 2025-10-30 10:46:07.666764+00 |
| description | High-performance parallel FASTA/FASTQ sequence counter |
| homepage | https://github.com/mulatta/seqtable |
| repository | https://github.com/mulatta/seqtable |
| max_upload_size | |
| id | 1907629 |
| size | 85,489 |
🧬 High-performance parallel FASTA/FASTQ sequence counter with multiple output formats
.gz files# Install from this repository
nix profile install github:mulatta/seqtable
# Or run directly
nix run github:mulatta/seqtable -- --help
git clone https://github.com/mulatta/seqtable
-cd seqtable
cd seqtable
cargo build --release
./target/release/seqtable --help
# Count sequences in a FASTQ file
seqtable input.fastq.gz
# Specify output directory
seqtable input.fastq.gz -o results/
# Use CSV format with RPM
seqtable input.fastq.gz -f csv --rpm
Use GNU parallel for processing multiple files:
# Process all FASTQ files in parallel (4 jobs)
parallel -j 4 seqtable {} -o results/ ::: *.fastq.gz
# Memory-aware processing
parallel --memfree 4G seqtable {} -o results/ ::: *.fq.gz
seqtable [OPTIONS] <INPUT>...
Arguments:
<INPUT>... Input file path(s) - FASTA/FASTQ/FASTQ.gz
Options:
-o, --output-dir <DIR> Output directory [default: .]
-s, --suffix <SUFFIX> Output filename suffix [default: _counts]
-f, --format <FORMAT> Output format [default: parquet]
[possible values: parquet, csv, tsv]
-c, --chunk-size <SIZE> Chunk size for parallel processing [default: 50000]
-t, --threads <N> Number of threads (0 = auto) [default: 0]
-q, --quiet Disable progress bar
--compression <TYPE> Parquet compression [default: snappy]
[possible values: none, snappy, gzip, brotli, zstd]
--rpm Calculate RPM (Reads Per Million)
-h, --help Print help
-V, --version Print version
# Parquet (default, best for data analysis)
seqtable input.fq.gz
# CSV (spreadsheet-friendly)
seqtable input.fq.gz -f csv
# TSV (tab-separated)
seqtable input.fq.gz -f tsv
# Add RPM column for normalization
seqtable input.fq.gz --rpm -f csv
# Output includes:
# sequence,count,rpm
# ATCGATCG,1000000,50000.00
# GCTAGCTA,500000,25000.00
# Custom output name and location
seqtable sample.fq.gz -o results/ -s .counts -f parquet
# Output: results/sample.counts.parquet
# Use 8 threads
seqtable input.fq.gz -t 8
# Larger chunks for big files (reduces overhead)
seqtable huge_file.fq.gz -c 100000
# Smaller chunks for memory-constrained systems
seqtable input.fq.gz -c 10000
Columnar format optimized for analytics:
# Read in Python
import polars as pl
df = pl.read_parquet("output_counts.parquet")
print(df.head())
Human-readable text formats:
sequence,count,rpm
ATCGATCGATCG,1500000,75000.00
GCTAGCTAGCTA,1000000,50000.00
TTAATTAATTAA,500000,25000.00
Typical performance on a 16-core system:
| File Size | Reads | Time | Memory |
|---|---|---|---|
| 1 GB | 10M | ~15s | ~500MB |
| 10 GB | 100M | ~60s | ~2GB |
| 100 GB | 1B | ~600s | ~2GB |
Key Features:
| Format | Extension | Compression | Streaming |
|---|---|---|---|
| FASTA | .fa, .fasta |
❌ | ✅ |
| FASTQ | .fq, .fastq |
❌ | ✅ |
| FASTA.gz | .fa.gz |
✅ | ✅ |
| FASTQ.gz | .fq.gz |
✅ | ✅ |
Input File(s)
↓
Streaming Reader (needletail)
↓
Chunking (50K sequences)
↓
Parallel Counting (Rayon + AHashMap)
↓
Parallel Merge
↓
Optional RPM Calculation
↓
Output (Parquet/CSV/TSV)
chunk_size × threads × ~80 bytesunique_sequences × ~100 bytes# Debug build
nix develop
cargo build
# Release build with optimizations
cargo build --release
# With mold linker (faster)
mold -run cargo build --release
# Run tests
cargo test
# Generate test data
head -n 4000 input.fastq > test_small.fastq
seqtable test_small.fastq --rpm -f csv
# Time comparison
time seqtable large.fq.gz -t 1 # Single thread
time seqtable large.fq.gz -t 16 # 16 threads
# Memory profiling
/usr/bin/time -v seqtable input.fq.gz
# Reduce chunk size
seqtable input.fq.gz -c 10000
# Use fewer threads
seqtable input.fq.gz -t 4
# Increase threads
seqtable input.fq.gz -t $(nproc)
# Larger chunks (for large files)
seqtable input.fq.gz -c 100000
# Check I/O bottleneck
iostat -x 1
# Verify file format
head -n 4 input.fq.gz | gunzip
# Test with small sample
head -n 40000 input.fq.gz | gunzip > test.fq
seqtable test.fq
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License - see LICENSE file for details.
If you use this tool in your research, please cite: