| Crates.io | kira_cdh_compat_fastq_reader |
| lib.rs | kira_cdh_compat_fastq_reader |
| version | 0.1.3 |
| created_at | 2025-08-15 21:08:09.631559+00 |
| updated_at | 2025-08-16 19:45:46.301099+00 |
| description | Streaming FASTQ reader compatible with CD-HIT input handling (plain and .gz), safe idiomatic Rust API; sync and async. |
| homepage | https://github.com/ARyaskov/kira_cdh_compat_fastq_reader |
| repository | https://github.com/ARyaskov/kira_cdh_compat_fastq_reader |
| max_upload_size | |
| id | 1797534 |
| size | 88,720 |
Streaming FASTQ reader with CD-HIT–compatible input handling (plain and .gz), a safe, idiomatic Rust API, and optional async support.
mmap for faster plain-file reads.async feature (Tokio + async-compression).edition = 2024.[dependencies]
kira_cdh_compat_fastq_reader = "*"
Optional features:
[dependencies]
kira_cdh_compat_fastq_reader = { version = "0.1", features = ["async", "mmap", "zlib"] }
gzip — enabled by default (gzip via flate2 with miniz_oxide backend).zlib — switch flate2 to system zlib backend (closer to CD-HIT’s zlib path).mmap — enable memmap2 for plain files (reduces syscalls).async — enable async API (Tokio + async-compression).MSRV: 1.85.0 or newer (pinned).
.gz extension or magic bytes (1F 8B).use kira_cdh_compat_fastq_reader::{FastqReader, ReaderOptions, ErrorPolicy, LineMode};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// CD-HIT–compatible defaults:
let opts = ReaderOptions {
error_policy: ErrorPolicy::Skip, // keep going on malformed records
fastq_only: true, // reject FASTA '>' headers
line_mode: LineMode::Single, // single-line seq/qual
};
let mut rdr = FastqReader::from_path("reads.fastq.gz", opts)?;
for rec in &mut rdr {
let rec = match rec {
Ok(r) => r,
Err(e) => { eprintln!("skipped: {e}"); continue; }
};
println!("id={} len={}", rec.id, rec.len());
}
Ok(())
}
From stdin:
use std::io::{self, BufReader};
use kira_cdh_compat_fastq_reader::{FastqReader, ReaderOptions, ErrorPolicy, LineMode};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let opts = ReaderOptions { error_policy: ErrorPolicy::Return, fastq_only: true, line_mode: LineMode::Single };
let stdin = io::stdin();
let rdr = BufReader::new(stdin.lock());
let mut fq = FastqReader::from_bufread(rdr, opts);
for rec in &mut fq {
let r = rec?;
println!("{}", r.id);
}
Ok(())
}
Enable the
asyncfeature:kira_cdh_compat_fastq_reader = { version = "0.1", features = ["async"] }
use kira_cdh_compat_fastq_reader::{AsyncFastqReader, ReaderOptions, ErrorPolicy, LineMode};
#[tokio::main(flavor = "multi_thread")]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let opts = ReaderOptions {
error_policy: ErrorPolicy::Skip,
fastq_only: true,
line_mode: LineMode::Single,
};
let mut rdr = AsyncFastqReader::from_path("reads.fastq.gz", opts).await?;
while let Some(item) = rdr.next_record().await {
let rec = match item {
Ok(r) => r,
Err(e) => { eprintln!("skipped: {e}"); continue; }
};
println!("id={} len={}", rec.id, rec.len());
}
Ok(())
}
You can also wrap any AsyncBufRead via:
// AsyncFastqReader::from_async_bufread(reader, opts)
Single-line (default): after the @header, sequence is exactly one line, + is one line, quality is exactly one line. This matches typical Illumina output and how CD-HIT often sees inputs.
Multi-line: sequence and/or quality may span multiple lines. Enable via:
ReaderOptions { line_mode: LineMode::Multi, ..Default::default() }
Note: Single-line mode is both stricter and faster. If your datasets are multi-line, switch to LineMode::Multi.
enum ErrorPolicy {
Skip, // default: skip malformed records and continue (CD-HIT-like)
Return, // fail fast on first malformed record
}
Typical format errors include:
@ (or encountering FASTA > in FASTQ-only mode).+ line.All errors carry an I/O context (byte offset and line number).
With ErrorPolicy::Skip, the parser attempts to resynchronize at the next line starting with @. This mirrors the robust “keep going” behavior often expected in CD-HIT pipelines when inputs contain occasional malformed records.
Types
FastqReader — synchronous streaming reader (plain or .gz).AsyncFastqReader — asynchronous streaming reader (feature async).FastqRecord — { id, desc: Option<String>, seq: Vec<u8>, qual: Vec<u8> }.ReaderOptions — { error_policy, fastq_only, line_mode }.ErrorPolicy — Skip or Return.LineMode — Single or Multi.FastqError / FormatError — detailed error types with context.Construction
// sync
let mut r = FastqReader::from_path("reads.fastq.gz", opts)?;
// or
let mut r = FastqReader::from_bufread(my_buf_reader, opts);
// async
let mut ar = AsyncFastqReader::from_path("reads.fastq.gz", opts).await?;
// or
let mut ar = AsyncFastqReader::from_async_bufread(my_async_bufread, opts);
Iteration
// sync
for item in &mut r {
let rec = item?; // or handle Skip policy
// ...
}
// async
while let Some(item) = ar.next_record().await {
let rec = item?; // or handle Skip policy
// ...
}
Plain FASTQ + mmap (--features mmap): can reduce syscalls and improve throughput on fast storage (commonly +5–30% vs buffered reads).
Gzip:
flate2 backend (miniz_oxide) provides solid performance.--features zlib switches to system zlib for closer parity with CD-HIT’s zlib path.I/O-bound workloads benefit most from larger buffers and sequential access patterns; CPU-bound cases (e.g., heavy downstream processing) usually dwarf parse costs.
Use cargo bench to evaluate on your hardware and datasets.
# default features (gzip enabled)
cargo test
# all features
cargo test --all-features
# benches
cargo bench
Tests cover:
async feature).Licensed under GPLv2 like a CD-Hit.