| Crates.io | kira-cdh |
| lib.rs | kira-cdh |
| version | 0.1.1 |
| created_at | 2025-08-17 22:59:34.49659+00 |
| updated_at | 2025-08-18 02:25:54.360725+00 |
| description | Single-binary, CLI-compatible replacement for CD-HIT utilities (cd-hit, cd-hit-est, cd-hit-2d, cd-hit-est-2d) in Rust. |
| homepage | |
| repository | https://github.com/ARyaskov/kira-cdh |
| max_upload_size | |
| id | 1799761 |
| size | 154,000 |
kira-cdh is a single-binary, CLI-compatible replacement for the core CD-HIT utilities:
--mode cd-hit — protein clustering--mode cd-hit-est — nucleotide clustering--mode cd-hit-2d — two-dataset protein comparison--mode cd-hit-est-2d — two-dataset nucleotide comparisonIt accepts the same flags as the original tools. The pipeline is implemented in Rust (edition 2024) using a modular stack:
FASTX I/O → k-mer hashing → KMV/MinHash signatures → LSH candidate retrieval → greedy clustering → CD-HIT compatible .clstr writer.
Status: v0.1 focuses on the 4 core modes above. Additional CD-HIT variants (e.g. PSI-CD-HIT, 454/OTU/LAP/DUP) can be added later via the same
--modemechanism.
--mode switch, drop-in CLI flag compatibility..gz) with robust, CD-HIT-like error handling..clstr output compatibility.# Requires Rust stable (MSRV = 1.85)
git clone https://github.com/ARyaskov/kira-cdh
cd kira-cdh
cargo install --path .
Protein clustering (cd-hit):
kira-cdh --mode cd-hit \
-i proteins.fasta -o clusters -c 0.9 -n 5 -T 16
Nucleotide clustering (cd-hit-est):
kira-cdh --mode cd-hit-est \
-i reads.fasta.gz -o clusters -c 0.97 -n 10 -T 16
Two-dataset comparison (protein):
kira-cdh --mode cd-hit-2d \
--i proteinsA.fasta \
--i2 proteinsB.fasta \
-o B_vs_A -c 0.9 -n 5 -T 16
Two-dataset comparison (nucleotide):
kira-cdh --mode cd-hit-est-2d \
--i readsA.fasta.gz \
--i2 readsB.fasta.gz \
-o B_vs_A -c 0.97 -n 10 -T 16
Outputs:
<prefix> — FASTA with cluster representatives<prefix>.clstr — CD-HIT-compatible cluster fileThe tool exposes the same flags as the original utilities for the selected mode. Run:
kira-cdh --mode <cd-hit|cd-hit-est|cd-hit-2d|cd-hit-est-2d> --help
-i <file> — input (FASTA/FASTQ; .gz supported)-o <prefix> — output prefix-c <float> — identity threshold [0..1] (used for MinHash Jaccard gate)-n <int> — word length (k-mer size). Defaults: protein=5, nucleotide=10-T <int> — threads (0 = all CPUs)-M <int> — memory limit (advisory)-d <int> — description length in output FASTA (0 = full)-aS, -aL, -A, -s, -S, -uS, -uL, -U--match, --mismatch, --gap, --gap-ext--sf, --sc, --bak, -p--i2 <file> — second input (required)--s2, --S2 (length-diff gates for db1)-P 1 -j <R2.fastq> --op <out_R2> — paired-end passthrough hooksNote: For v0.1, only the flags that affect the LSH/Jaccard/greedy stages are functionally active (see details below). Other flags are parsed and validated, but may be no-ops at this stage; see Feature parity.
Internally, identity gating uses MinHash/KMV signatures and LSH:
bands = 32, rows = 4 (compatible with signature length).ceil(c * rows) collisions, where c is -c.jaccard_from_signatures() ≥ -c.Greedy clustering is CD-HIT-like:
--sc/--sf affect output sorting only).Jaccard ≥ -c. Otherwise the B sequence forms a singleton cluster..clstr output:
nt/aa).*..gz)Implemented end-to-end:
-i, -o, --i2 (2D), -c, -n, -T, -d.clstr writer compatibility-aS / -aL coverage gates (basic support)
When either is set > 0, the clusterer is configured with corresponding coverage thresholds.Parsed & validated (currently no-op or partial; accepted for CLI parity):
-M, -G, -b, -t, -s, -S, -A, -uS, -uL, -U-p, --sf, --sc, --bak--match, --mismatch, --gap, --gap-ext)-P, -j, --op, --cx, --cy, --ap, -r) — parsed; not all affect clustering yetIf you depend on a specific flag’s exact upstream semantics that are not listed under “Implemented end-to-end”, please open an issue. The plan is to add strict fail-fast checks for unsupported semantics in a subsequent minor release.
-T (default: 1; 0 = all CPUs).-n (defaults: protein=5, nucleotide=10).-c influences LSH and final Jaccard acceptance.-M is advisory in v0.1 (no strict cgroup/pid limit). Indexing is streaming; memory depends mainly on signature storage and LSH buckets.Cluster proteins at 90% identity:
kira-cdh --mode cd-hit \
-i uniprot_sprot.fasta \
-o sprot90 \
-c 0.90 -n 5 -T 32
Cluster reads at 97% identity (nucleotide):
kira-cdh --mode cd-hit-est \
-i reads.fasta.gz \
-o reads97 \
-c 0.97 -n 10 -T 16
Compare B against A (protein 2D):
kira-cdh --mode cd-hit-2d \
-i A.faa -i2 B.faa \
-o B_vs_A \
-c 0.9 -n 5
The resulting B_vs_A.clstr contains A-anchored clusters for matches and singleton clusters for unmatched B sequences.
Set RUST_LOG to tune verbosity:
RUST_LOG=info kira-cdh --mode cd-hit -i input.fa -o out -c 0.9
RUST_LOG=debug kira-cdh --mode cd-hit-est -i input.fq.gz -o out -c 0.97
README.md (Feature parity) and add a validation path.GPLv2.
We would like to thank the original authors and maintainers of CD-HIT for their contributions to the field of sequence clustering, which served as an inspiration for this project.