kira-cdh

Crates.io	kira-cdh
lib.rs	kira-cdh
version	0.1.1
created_at	2025-08-17 22:59:34.49659+00
updated_at	2025-08-18 02:25:54.360725+00
description	Single-binary, CLI-compatible replacement for CD-HIT utilities (cd-hit, cd-hit-est, cd-hit-2d, cd-hit-est-2d) in Rust.
homepage
repository	https://github.com/ARyaskov/kira-cdh
max_upload_size
id	1799761
size	154,000

Andrei Riaskóv (ARyaskov)

documentation

https://docs.rs/kira-cdh

README

Kira CDH

kira-cdh is a single-binary, CLI-compatible replacement for the core CD-HIT utilities:

--mode cd-hit — protein clustering
--mode cd-hit-est — nucleotide clustering
--mode cd-hit-2d — two-dataset protein comparison
--mode cd-hit-est-2d — two-dataset nucleotide comparison

It accepts the same flags as the original tools. The pipeline is implemented in Rust (edition 2024) using a modular stack:

FASTX I/O → k-mer hashing → KMV/MinHash signatures → LSH candidate retrieval → greedy clustering → CD-HIT compatible .clstr writer.

Status: v0.1 focuses on the 4 core modes above. Additional CD-HIT variants (e.g. PSI-CD-HIT, 454/OTU/LAP/DUP) can be added later via the same --mode mechanism.

Highlights

Single binary with --mode switch, drop-in CLI flag compatibility.
Fast I/O (FASTA/FASTQ, transparent .gz) with robust, CD-HIT-like error handling.
Scalable indexing via KMV/MinHash signatures and LSH for candidate discovery.
Greedy clustering (length-first, CD-HIT-like) with optional coverage gates.
CD-HIT .clstr output compatibility.

Installation

From source

# Requires Rust stable (MSRV = 1.85)
git clone https://github.com/ARyaskov/kira-cdh
cd kira-cdh
cargo install --path .

Quick start

Protein clustering (cd-hit):

kira-cdh --mode cd-hit \
  -i proteins.fasta -o clusters -c 0.9 -n 5 -T 16

Nucleotide clustering (cd-hit-est):

kira-cdh --mode cd-hit-est \
  -i reads.fasta.gz -o clusters -c 0.97 -n 10 -T 16

Two-dataset comparison (protein):

kira-cdh --mode cd-hit-2d \
  --i proteinsA.fasta \
  --i2 proteinsB.fasta \
  -o B_vs_A -c 0.9 -n 5 -T 16

Two-dataset comparison (nucleotide):

kira-cdh --mode cd-hit-est-2d \
  --i readsA.fasta.gz \
  --i2 readsB.fasta.gz \
  -o B_vs_A -c 0.97 -n 10 -T 16

Outputs:

<prefix> — FASTA with cluster representatives
<prefix>.clstr — CD-HIT-compatible cluster file

CLI compatibility

The tool exposes the same flags as the original utilities for the selected mode. Run:

kira-cdh --mode <cd-hit|cd-hit-est|cd-hit-2d|cd-hit-est-2d> --help

Common flags (subset)

-i <file> — input (FASTA/FASTQ; .gz supported)
-o <prefix> — output prefix
-c <float> — identity threshold [0..1] (used for MinHash Jaccard gate)
-n <int> — word length (k-mer size). Defaults: protein=5, nucleotide=10
-T <int> — threads (0 = all CPUs)
-M <int> — memory limit (advisory)
-d <int> — description length in output FASTA (0 = full)
Coverage/length controls: -aS, -aL, -A, -s, -S, -uS, -uL, -U
Nucleotide scoring knobs (kept for compatibility): --match, --mismatch, --gap, --gap-ext
Sorting/format knobs: --sf, --sc, --bak, -p

2D modes

--i2 <file> — second input (required)
Optional asymmetric cutoffs: --s2, --S2 (length-diff gates for db1)

Paired-end (cd-hit-est only)

-P 1 -j <R2.fastq> --op <out_R2> — paired-end passthrough hooks

Note: For v0.1, only the flags that affect the LSH/Jaccard/greedy stages are functionally active (see details below). Other flags are parsed and validated, but may be no-ops at this stage; see Feature parity.

Mapping to CD-HIT semantics

Internally, identity gating uses MinHash/KMV signatures and LSH:

Signatures: KMV, length = 128 by default.
LSH: bands = 32, rows = 4 (compatible with signature length).
Candidate retrieval: keep pairs with at least ceil(c * rows) collisions, where c is -c.
Final acceptance in 2D mode uses jaccard_from_signatures() ≥ -c.

Greedy clustering is CD-HIT-like:

Representatives are chosen in length-descending order (--sc/--sf affect output sorting only).
For 1-set modes, clustering runs over the entire set.
For 2D modes, set A is indexed; each sequence from B is assigned to the best matching A if Jaccard ≥ -c. Otherwise the B sequence forms a singleton cluster.

.clstr output:

Written via a CD-HIT-compatible writer, with optional length annotations (nt/aa).
The first member in a cluster is the representative and ends with *.

Input formats

FASTA, FASTQ (transparently supports .gz)
Multi-line FASTA/FASTQ supported
Robust error handling (skip malformed records, attempt resynchronization)

Feature parity (v0.1)

Implemented end-to-end:

-i, -o, --i2 (2D), -c, -n, -T, -d
CD-HIT-like greedy clustering (length-first)
LSH candidate retrieval + MinHash/KMV Jaccard gate
.clstr writer compatibility
-aS / -aL coverage gates (basic support) When either is set > 0, the clusterer is configured with corresponding coverage thresholds.

Parsed & validated (currently no-op or partial; accepted for CLI parity):

-M, -G, -b, -t, -s, -S, -A, -uS, -uL, -U
-p, --sf, --sc, --bak
Nucleotide scoring (--match, --mismatch, --gap, --gap-ext)
Paired-end hooks (-P, -j, --op, --cx, --cy, --ap, -r) — parsed; not all affect clustering yet

If you depend on a specific flag’s exact upstream semantics that are not listed under “Implemented end-to-end”, please open an issue. The plan is to add strict fail-fast checks for unsupported semantics in a subsequent minor release.

Performance knobs

Threads: -T (default: 1; 0 = all CPUs).
k-mer size: -n (defaults: protein=5, nucleotide=10).
Identity threshold: -c influences LSH and final Jaccard acceptance.
Signature length: currently fixed to 128 (32×4); future releases may expose this.
Memory: -M is advisory in v0.1 (no strict cgroup/pid limit). Indexing is streaming; memory depends mainly on signature storage and LSH buckets.

Examples

Cluster proteins at 90% identity:

kira-cdh --mode cd-hit \
  -i uniprot_sprot.fasta \
  -o sprot90 \
  -c 0.90 -n 5 -T 32

Cluster reads at 97% identity (nucleotide):

kira-cdh --mode cd-hit-est \
  -i reads.fasta.gz \
  -o reads97 \
  -c 0.97 -n 10 -T 16

Compare B against A (protein 2D):

kira-cdh --mode cd-hit-2d \
  -i A.faa -i2 B.faa \
  -o B_vs_A \
  -c 0.9 -n 5

The resulting B_vs_A.clstr contains A-anchored clusters for matches and singleton clusters for unmatched B sequences.

Logging

Set RUST_LOG to tune verbosity:

RUST_LOG=info kira-cdh --mode cd-hit -i input.fa -o out -c 0.9
RUST_LOG=debug kira-cdh --mode cd-hit-est -i input.fq.gz -o out -c 0.97

Contributing

Follow KISS/DRY principles; prefer small, well-documented modules.
For performance work, include before/after benchmarks and dataset notes.
When wiring a new CD-HIT flag, update README.md (Feature parity) and add a validation path.

License

GPLv2.

Acknowledgements

We would like to thank the original authors and maintainers of CD-HIT for their contributions to the field of sequence clustering, which served as an inspiration for this project.

Commit count: 0