seqtable

Crates.ioseqtable
lib.rsseqtable
version0.1.1
created_at2025-10-30 02:38:38.245422+00
updated_at2025-10-30 10:46:07.666764+00
descriptionHigh-performance parallel FASTA/FASTQ sequence counter
homepagehttps://github.com/mulatta/seqtable
repositoryhttps://github.com/mulatta/seqtable
max_upload_size
id1907629
size85,489
mulatta (mulatta)

documentation

https://docs.rs/seqtable

README

seqtable

🧬 High-performance parallel FASTA/FASTQ sequence counter with multiple output formats

License: MIT

Features

  • Fast: Parallel processing with Rayon (5-10x speedup on multi-core systems)
  • 💾 Memory Efficient: Streaming I/O with constant memory usage
  • 📊 Multiple Formats: Parquet, CSV, TSV output
  • 📈 RPM Calculation: Optional Reads Per Million normalization
  • 🗜️ Compression: Native support for .gz files
  • 🎯 Simple: Single binary with no dependencies

Installation

Using Nix (Recommended)

# Install from this repository
nix profile install github:mulatta/seqtable

# Or run directly
nix run github:mulatta/seqtable -- --help

From Source

git clone https://github.com/mulatta/seqtable
-cd seqtable
cd seqtable
cargo build --release
./target/release/seqtable --help

Quick Start

Basic Usage

# Count sequences in a FASTQ file
seqtable input.fastq.gz

# Specify output directory
seqtable input.fastq.gz -o results/

# Use CSV format with RPM
seqtable input.fastq.gz -f csv --rpm

Multiple Files

Use GNU parallel for processing multiple files:

# Process all FASTQ files in parallel (4 jobs)
parallel -j 4 seqtable {} -o results/ ::: *.fastq.gz

# Memory-aware processing
parallel --memfree 4G seqtable {} -o results/ ::: *.fq.gz

Usage

seqtable [OPTIONS] <INPUT>...

Arguments:
  <INPUT>...  Input file path(s) - FASTA/FASTQ/FASTQ.gz

Options:
  -o, --output-dir <DIR>        Output directory [default: .]
  -s, --suffix <SUFFIX>         Output filename suffix [default: _counts]
  -f, --format <FORMAT>         Output format [default: parquet]
                                [possible values: parquet, csv, tsv]
  -c, --chunk-size <SIZE>       Chunk size for parallel processing [default: 50000]
  -t, --threads <N>             Number of threads (0 = auto) [default: 0]
  -q, --quiet                   Disable progress bar
  --compression <TYPE>          Parquet compression [default: snappy]
                                [possible values: none, snappy, gzip, brotli, zstd]
  --rpm                         Calculate RPM (Reads Per Million)
  -h, --help                    Print help
  -V, --version                 Print version

Examples

Output Formats

# Parquet (default, best for data analysis)
seqtable input.fq.gz

# CSV (spreadsheet-friendly)
seqtable input.fq.gz -f csv

# TSV (tab-separated)
seqtable input.fq.gz -f tsv

With RPM Calculation

# Add RPM column for normalization
seqtable input.fq.gz --rpm -f csv

# Output includes:
# sequence,count,rpm
# ATCGATCG,1000000,50000.00
# GCTAGCTA,500000,25000.00

Custom Output

# Custom output name and location
seqtable sample.fq.gz -o results/ -s .counts -f parquet

# Output: results/sample.counts.parquet

Performance Tuning

# Use 8 threads
seqtable input.fq.gz -t 8

# Larger chunks for big files (reduces overhead)
seqtable huge_file.fq.gz -c 100000

# Smaller chunks for memory-constrained systems
seqtable input.fq.gz -c 10000

Output Format

Parquet (default)

Columnar format optimized for analytics:

  • Efficient compression
  • Fast queries with tools like DuckDB, Polars
  • Schema preservation
# Read in Python
import polars as pl
df = pl.read_parquet("output_counts.parquet")
print(df.head())

CSV/TSV

Human-readable text formats:

sequence,count,rpm
ATCGATCGATCG,1500000,75000.00
GCTAGCTAGCTA,1000000,50000.00
TTAATTAATTAA,500000,25000.00

Performance

Typical performance on a 16-core system:

File Size Reads Time Memory
1 GB 10M ~15s ~500MB
10 GB 100M ~60s ~2GB
100 GB 1B ~600s ~2GB

Key Features:

  • Linear scaling with CPU cores
  • Constant memory usage regardless of file size
  • Efficient handling of gzip-compressed files

File Format Support

Format Extension Compression Streaming
FASTA .fa, .fasta
FASTQ .fq, .fastq
FASTA.gz .fa.gz
FASTQ.gz .fq.gz

Architecture

Processing Pipeline

Input File(s)
    ↓
Streaming Reader (needletail)
    ↓
Chunking (50K sequences)
    ↓
Parallel Counting (Rayon + AHashMap)
    ↓
Parallel Merge
    ↓
Optional RPM Calculation
    ↓
Output (Parquet/CSV/TSV)

Memory Usage

  • Base: ~100MB (program overhead)
  • Chunks: chunk_size × threads × ~80 bytes
  • HashMap: unique_sequences × ~100 bytes
  • Total: Typically 1-3GB for large files

Key Optimizations

  1. Streaming I/O: Files processed incrementally
  2. Parallel Hashing: Multi-threaded counting with AHash
  3. Zero-Copy: Minimal data duplication
  4. Adaptive Chunking: Optimal chunk size selection

Development

Building

# Debug build
nix develop
cargo build

# Release build with optimizations
cargo build --release

# With mold linker (faster)
mold -run cargo build --release

Testing

# Run tests
cargo test

# Generate test data
head -n 4000 input.fastq > test_small.fastq
seqtable test_small.fastq --rpm -f csv

Benchmarking

# Time comparison
time seqtable large.fq.gz -t 1    # Single thread
time seqtable large.fq.gz -t 16   # 16 threads

# Memory profiling
/usr/bin/time -v seqtable input.fq.gz

Troubleshooting

Out of Memory

# Reduce chunk size
seqtable input.fq.gz -c 10000

# Use fewer threads
seqtable input.fq.gz -t 4

Slow Performance

# Increase threads
seqtable input.fq.gz -t $(nproc)

# Larger chunks (for large files)
seqtable input.fq.gz -c 100000

# Check I/O bottleneck
iostat -x 1

File Format Issues

# Verify file format
head -n 4 input.fq.gz | gunzip

# Test with small sample
head -n 40000 input.fq.gz | gunzip > test.fq
seqtable test.fq

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Citation

If you use this tool in your research, please cite:

Acknowledgments

See Also

  • seqkit - FASTA/FASTQ toolkit
  • fastp - Fast preprocessing
  • bbmap - Comprehensive toolkit
Commit count: 0

cargo fmt