kira-cdh

Crates.iokira-cdh
lib.rskira-cdh
version0.1.1
created_at2025-08-17 22:59:34.49659+00
updated_at2025-08-18 02:25:54.360725+00
descriptionSingle-binary, CLI-compatible replacement for CD-HIT utilities (cd-hit, cd-hit-est, cd-hit-2d, cd-hit-est-2d) in Rust.
homepage
repositoryhttps://github.com/ARyaskov/kira-cdh
max_upload_size
id1799761
size154,000
Andrei Riaskóv (ARyaskov)

documentation

https://docs.rs/kira-cdh

README

Kira CDH

kira-cdh is a single-binary, CLI-compatible replacement for the core CD-HIT utilities:

  • --mode cd-hit — protein clustering
  • --mode cd-hit-est — nucleotide clustering
  • --mode cd-hit-2d — two-dataset protein comparison
  • --mode cd-hit-est-2d — two-dataset nucleotide comparison

It accepts the same flags as the original tools. The pipeline is implemented in Rust (edition 2024) using a modular stack:

FASTX I/O → k-mer hashing → KMV/MinHash signatures → LSH candidate retrieval → greedy clustering → CD-HIT compatible .clstr writer.

Status: v0.1 focuses on the 4 core modes above. Additional CD-HIT variants (e.g. PSI-CD-HIT, 454/OTU/LAP/DUP) can be added later via the same --mode mechanism.


Highlights

  • Single binary with --mode switch, drop-in CLI flag compatibility.
  • Fast I/O (FASTA/FASTQ, transparent .gz) with robust, CD-HIT-like error handling.
  • Scalable indexing via KMV/MinHash signatures and LSH for candidate discovery.
  • Greedy clustering (length-first, CD-HIT-like) with optional coverage gates.
  • CD-HIT .clstr output compatibility.

Installation

From source

# Requires Rust stable (MSRV = 1.85)
git clone https://github.com/ARyaskov/kira-cdh
cd kira-cdh
cargo install --path .

Quick start

Protein clustering (cd-hit):

kira-cdh --mode cd-hit \
  -i proteins.fasta -o clusters -c 0.9 -n 5 -T 16

Nucleotide clustering (cd-hit-est):

kira-cdh --mode cd-hit-est \
  -i reads.fasta.gz -o clusters -c 0.97 -n 10 -T 16

Two-dataset comparison (protein):

kira-cdh --mode cd-hit-2d \
  --i proteinsA.fasta \
  --i2 proteinsB.fasta \
  -o B_vs_A -c 0.9 -n 5 -T 16

Two-dataset comparison (nucleotide):

kira-cdh --mode cd-hit-est-2d \
  --i readsA.fasta.gz \
  --i2 readsB.fasta.gz \
  -o B_vs_A -c 0.97 -n 10 -T 16

Outputs:

  • <prefix> — FASTA with cluster representatives
  • <prefix>.clstr — CD-HIT-compatible cluster file

CLI compatibility

The tool exposes the same flags as the original utilities for the selected mode. Run:

kira-cdh --mode <cd-hit|cd-hit-est|cd-hit-2d|cd-hit-est-2d> --help

Common flags (subset)

  • -i <file> — input (FASTA/FASTQ; .gz supported)
  • -o <prefix> — output prefix
  • -c <float> — identity threshold [0..1] (used for MinHash Jaccard gate)
  • -n <int> — word length (k-mer size). Defaults: protein=5, nucleotide=10
  • -T <int> — threads (0 = all CPUs)
  • -M <int> — memory limit (advisory)
  • -d <int> — description length in output FASTA (0 = full)
  • Coverage/length controls: -aS, -aL, -A, -s, -S, -uS, -uL, -U
  • Nucleotide scoring knobs (kept for compatibility): --match, --mismatch, --gap, --gap-ext
  • Sorting/format knobs: --sf, --sc, --bak, -p

2D modes

  • --i2 <file> — second input (required)
  • Optional asymmetric cutoffs: --s2, --S2 (length-diff gates for db1)

Paired-end (cd-hit-est only)

  • -P 1 -j <R2.fastq> --op <out_R2> — paired-end passthrough hooks

Note: For v0.1, only the flags that affect the LSH/Jaccard/greedy stages are functionally active (see details below). Other flags are parsed and validated, but may be no-ops at this stage; see Feature parity.


Mapping to CD-HIT semantics

Internally, identity gating uses MinHash/KMV signatures and LSH:

  • Signatures: KMV, length = 128 by default.
  • LSH: bands = 32, rows = 4 (compatible with signature length).
  • Candidate retrieval: keep pairs with at least ceil(c * rows) collisions, where c is -c.
  • Final acceptance in 2D mode uses jaccard_from_signatures()-c.

Greedy clustering is CD-HIT-like:

  • Representatives are chosen in length-descending order (--sc/--sf affect output sorting only).
  • For 1-set modes, clustering runs over the entire set.
  • For 2D modes, set A is indexed; each sequence from B is assigned to the best matching A if Jaccard ≥ -c. Otherwise the B sequence forms a singleton cluster.

.clstr output:

  • Written via a CD-HIT-compatible writer, with optional length annotations (nt/aa).
  • The first member in a cluster is the representative and ends with *.

Input formats

  • FASTA, FASTQ (transparently supports .gz)
  • Multi-line FASTA/FASTQ supported
  • Robust error handling (skip malformed records, attempt resynchronization)

Feature parity (v0.1)

Implemented end-to-end:

  • -i, -o, --i2 (2D), -c, -n, -T, -d
  • CD-HIT-like greedy clustering (length-first)
  • LSH candidate retrieval + MinHash/KMV Jaccard gate
  • .clstr writer compatibility
  • -aS / -aL coverage gates (basic support) When either is set > 0, the clusterer is configured with corresponding coverage thresholds.

Parsed & validated (currently no-op or partial; accepted for CLI parity):

  • -M, -G, -b, -t, -s, -S, -A, -uS, -uL, -U
  • -p, --sf, --sc, --bak
  • Nucleotide scoring (--match, --mismatch, --gap, --gap-ext)
  • Paired-end hooks (-P, -j, --op, --cx, --cy, --ap, -r) — parsed; not all affect clustering yet

If you depend on a specific flag’s exact upstream semantics that are not listed under “Implemented end-to-end”, please open an issue. The plan is to add strict fail-fast checks for unsupported semantics in a subsequent minor release.


Performance knobs

  • Threads: -T (default: 1; 0 = all CPUs).
  • k-mer size: -n (defaults: protein=5, nucleotide=10).
  • Identity threshold: -c influences LSH and final Jaccard acceptance.
  • Signature length: currently fixed to 128 (32×4); future releases may expose this.
  • Memory: -M is advisory in v0.1 (no strict cgroup/pid limit). Indexing is streaming; memory depends mainly on signature storage and LSH buckets.

Examples

Cluster proteins at 90% identity:

kira-cdh --mode cd-hit \
  -i uniprot_sprot.fasta \
  -o sprot90 \
  -c 0.90 -n 5 -T 32

Cluster reads at 97% identity (nucleotide):

kira-cdh --mode cd-hit-est \
  -i reads.fasta.gz \
  -o reads97 \
  -c 0.97 -n 10 -T 16

Compare B against A (protein 2D):

kira-cdh --mode cd-hit-2d \
  -i A.faa -i2 B.faa \
  -o B_vs_A \
  -c 0.9 -n 5

The resulting B_vs_A.clstr contains A-anchored clusters for matches and singleton clusters for unmatched B sequences.


Logging

Set RUST_LOG to tune verbosity:

RUST_LOG=info kira-cdh --mode cd-hit -i input.fa -o out -c 0.9
RUST_LOG=debug kira-cdh --mode cd-hit-est -i input.fq.gz -o out -c 0.97

Contributing

  • Follow KISS/DRY principles; prefer small, well-documented modules.
  • For performance work, include before/after benchmarks and dataset notes.
  • When wiring a new CD-HIT flag, update README.md (Feature parity) and add a validation path.

License

GPLv2.


Acknowledgements

We would like to thank the original authors and maintainers of CD-HIT for their contributions to the field of sequence clustering, which served as an inspiration for this project.

Commit count: 0

cargo fmt