| Crates.io | kira_cdh_compat_lsh |
| lib.rs | kira_cdh_compat_lsh |
| version | 0.1.0 |
| created_at | 2025-08-16 02:25:03.903953+00 |
| updated_at | 2025-08-16 02:25:03.903953+00 |
| description | Candidate search (LSH/MinHash & KMV) for high-identity sequence clustering pipelines compatible with CD-HIT outputs. |
| homepage | |
| repository | https://github.com/ARyaskov/kira_cdh_compat_lsh |
| max_upload_size | |
| id | 1797780 |
| size | 66,562 |
Candidate search (LSH - Locality-Sensitive Hashing) primitives for CD-HIT–style clustering pipelines.
This crate focuses on fast candidate retrieval between sequences represented as sets of pre-hashed k-mers (u64). It provides:
parallel, enabled by default)Out of scope: FASTA/FASTQ parsing and
.clstrwriting (handled by other crates in our stack).
Goal: plug this crate between your k-mer hasher and your greedy clusterer.
[dependencies]
kira_cdh_compat_lsh = "*"
# Optional features (recommended):
# default = ["parallel"]
parallel (default) — enables rayon for parallel build/queryserde — enables serde/serde_bytes for signature (de)serialization helpers in client codeMinimum supported Rust version (MSRV): 1.85.
use kira_cdh_compat_lsh::{
kmv::KmvSketch,
lsh::{LshIndex, LshParams},
sketch::jaccard_from_signatures,
};
// 1) You already have u64 k-mer hashes for each sequence:
let hashes_a: Vec<u64> = /* ... */ vec![11,12,13,100,101,102,1000,2000];
let hashes_b: Vec<u64> = /* ... */ vec![12,13,14,101,102,103,1000,3000];
// 2) Build compact signatures (KMV shown; MinHash also available):
let mut sa = KmvSketch::new(128); // keep k=128 minima
for h in hashes_a { sa.update(h); }
let sig_a = sa.finish(); // sorted ascending
let mut sb = KmvSketch::new(128);
for h in hashes_b { sb.update(h); }
let sig_b = sb.finish();
// 3) Build an LSH index and insert signatures
let params = LshParams::new(32, 4).unwrap(); // 32 bands × 4 rows = 128
let mut index = LshIndex::with_params(params);
index.insert(0, &sig_a).unwrap();
index.insert(1, &sig_b).unwrap();
index.build();
// 4) Query candidates for A
let cands = index.query_candidates(&sig_a, 1); // (id, collisions)
assert!(cands.iter().any(|(id, _)| *id == 1));
// 5) Optional: estimate Jaccard from two signatures
let j_est = jaccard_from_signatures(&sig_a, &sig_b);
eprintln!("Estimated Jaccard(A,B) ~ {j_est:.3}");
MinHash: classical K-permutation min estimator. More arithmetic, best-known estimator for Jaccard.
API: minhash::MinHash::new(num_hashes, seed0).update(u64).finish() -> Vec<u64>
KMV (bottom-k): keep the k smallest hashed values from a single hash stream; usually faster than MinHash and works very well as a candidate generator.
API: kmv::KmvSketch::new(k).update(u64).finish() -> Vec<u64> (sorted)
Both outputs are
Vec<u64>signatures that feed directly into LSH banding.
Split a signature (length bands * rows) into bands chunks of rows each.
Each chunk is folded into a deterministic 64-bit band key and placed in a bucket.
Two sequences are candidates if they share ≥ min_collisions band keys.
API:
let params = LshParams::new(bands, rows_per_band)?;
let mut index = LshIndex::with_params(params);
index.insert(seq_id, &signature)?;
index.build(); // optional finalize
let hits: Vec<(u32, u32)> = index.query_candidates(&signature, min_collisions);
Signature length: bands × rows. Typical values: 128–256 for high-identity clustering.
Heuristic for high identity (CD-HIT-like):
min_collisions to tighten precision for higher identity thresholds.KMV vs MinHash:
Mapping nucleotide/protein identity → Jaccard over k-mers depends on k and error model. Treat
min_collisionsas a tunable gate before the expensive, exact stage.
splitmix64-based fold (platform-independent).LshIndex uses FxHasher internally for fast deterministic maps.When parallel is enabled (default), building large indexes and batch querying can be parallelized in higher-level code
(e.g., producing signatures in parallel and calling insert from worker threads). The internal data structures
in this crate are optimized for fast single-threaded insert/query; batching at a higher level typically yields better scalability.
FASTA/FASTQ ──► k-mer hashing (u64) ──► (KMV | MinHash) signature ──► LSH candidates
│
└─► Greedy clusterer (exact check, identity/coverage)
└─► .clstr writer (CD-HIT-compatible)
This crate implements only the bold path: signature construction and candidate retrieval.
It intentionally does not depend on I/O or .clstr formats.
kmv::KmvSketch — KMV (bottom-k) signaturesminhash::MinHash — classical MinHash signatureslsh::{LshParams, LshIndex} — banding & candidate retrievalsketch::jaccard_from_signatures(a, b) — simple similarity estimate on same-length signaturesRun locally:
cargo bench --all-features
Tips:
Vec<u32> per band key. A compact arena layout (single Vec<u32> with offsets) can cut memory and speed up scans. Planned as a non-breaking internal change.min_collisions are workload-dependent; expose your tuning knobs in the higher layer that knows k, alphabet, and identity threshold.Contributions are welcome. Please run cargo fmt, cargo clippy -D warnings, and add tests for new behavior.
GPLv2