| Crates.io | kira_cdh_compat_kmer_indexer |
| lib.rs | kira_cdh_compat_kmer_indexer |
| version | 0.2.0 |
| created_at | 2025-08-16 19:46:31.550039+00 |
| updated_at | 2025-08-16 19:46:31.550039+00 |
| description | CD-HIT-compatible k-mer indexing (CD-HIT-NG) in Rust: fast, memory-efficient, mmap-ready. |
| homepage | |
| repository | https://github.com/ARyaskov/kira_cdh_compat_kmer_indexer |
| max_upload_size | |
| id | 1798804 |
| size | 124,929 |
CD-HIT-compatible k-mer indexing (“CD-HIT-NG”) for FASTQ in modern Rust (edition 2024).
min(fwd, rc))N are skipped (rolling state resets)u64 codes.kdx file, mmap-friendly; lazy decode for compressed postingsCD-HIT compatibility: encoding, canonicalization and
N‐handling match CD-HIT semantics so the index can drive a CD-HIT-style pipeline. The.kdxfile format is this crate’s own (compact & mmap-oriented).
# Build CLI
cargo build --release
# Create a k-mer index (31-mers, 4096 buckets, canonicalized)
./target/release/kdx-build \
--input reads.fastq.gz \
--output reads.kdx \
--k 31 \
--bucket-bits 12
Reopen and query in your own code:
use kira_cdh_compat_kmer_indexing as kdx;
use std::path::Path;
let idx = kdx::KmerIndex::open_mmap(Path::new("reads.kdx"))?;
let code = {
// 31-mer MSB-aligned, canonicalized
let lsb = kdx::encode::encode_kmer(b"ACGTACGTACGTACGTACGTACGTACGTACG").unwrap();
kdx::encode::canonical(lsb, 31)
};
let range = idx.locate(code);
for p in idx.postings(range) {
// p.read_id (u32), p.pos (u32)
}
# Ok::<(), Box<dyn std::error::Error>>(())
# Cargo.toml
[dependencies]
kira_cdh_compat_kmer_indexer = "*"
Feature flags
mmap (default) — enables memmap2 + bytemuck for mmap readsasync — enables the async builder (requires tokio)simd_x86 — enables optional AVX2 path for ASCII→2-bit mapping (auto-fallback)use kira_cdh_compat_kmer_indexer::*;
use kira_cdh_compat_kmer_indexer::io::Compression;
use kira_cdh_compat_kmer_indexer::index::LocateStrategy;
use std::path::Path;
let idx = build_kmer_index_sync(
Path::new("reads.fastq.gz"),
ReaderOptions::default(),
31,
BuildConfig::default()
.with_bucket_bits(12) // B ∈ [10..14] typical
.canonical(true) // default true
.with_heads(true) // required when compression != None
.compression(Compression::DeltaVarint)
.locate_strategy(LocateStrategy::PlrSampling)
.plr_stride(256)
)?;
// Optional: serialize to disk (v2 format)
KmerIndexWriter::new(&idx).write_to(Path::new("reads.kdx"))?;
Enable the async feature and use:
let idx = build_kmer_index_async(
Path::new("reads.fastq.gz"),
ReaderOptions::default(),
31,
BuildConfig::default()
).await?;
let kdx = KmerIndex::open_mmap(Path::new("reads.kdx"))?;
let msb_code = encode::canonical(encode::encode_kmer(b"ACGT...window").unwrap(), 31);
let range = kdx.locate(msb_code); // opaque; pass it unchanged
let posts = kdx.postings(range); // &[Posting { read_id, pos }]
rangeis opaque (internally tagged with bucket id). Always pass it unchanged topostings().
kdx-build --input <FASTQ[.gz]> --output <reads.kdx> --k <1..=32>
[--bucket-bits B] [--no-canonical] [--heads]
[--compression {none|dv}] [--locate {bin|plr}] [--plr-stride N]
[--min-read-len N] [--threads N]
Recommended presets
--k 31 --bucket-bits 12 --locate plr --plr-stride 256--compression dv --heads (heads are required by dv)u64 (MSB-aligned).
This makes bucket extraction trivial and keeps radix passes cache-friendly.Per base:
fwd = ((fwd << 2) | v) & mask
rc = (rc >> 2) | ((v ^ 0b11) << (2*(k-1)))
code = min(fwd, rc) << (64 - 2k)
Ambiguous base (IUPAC/N) → reset fwd, rc, and window length.
B to min(14, 2k) to avoid degenerate bucketing for small k.codes[bucket].(key, pos); at query time, upper-bound in the samples and binary-search only within a narrow window. Typical stride 256–512.Compression::DeltaVarint:
heads[] (first index of each unique code run).(read_id, pos) and encoded as LEB128: first pair absolute, subsequent as deltas.postings() access and cached in memory.--release with RUSTFLAGS="-C target-cpu=native".--threads to cap parallelism or let Rayon use logical cores.B=12 (4096 buckets) is a solid default for k=31..kdx)Version 2 (mmap-friendly, supports compression/samples). All integers are little-endian; payload aligned to 8 bytes.
| Field | Type | Notes |
|---|---|---|
magic |
u32 | "KDX1" |
version |
u32 | 2 |
k |
u16 | k-mer length |
canonical |
u8 | 0/1 |
bucket_bits |
u8 | B |
buckets |
u32 | 1 << B |
total_entries |
u64 | sum of all postings |
dir_offset |
u64 | byte offset of the directory |
compression |
u8 | 0=None, 1=DeltaVarint |
locate_strategy |
u8 | 0=BinarySearch, 1=PlrSampling |
reserved0 |
u16 | 0 |
reserved1 |
u32 | 0 |
| Field | Type | Meaning |
|---|---|---|
off_codes |
u64 | byte offset of codes[] (u64 MSB-aligned keys) |
len_codes |
u64 | number of u64 codes |
off_posts |
u64 | if compressed: byte offset of encoded postings; else: array offset |
len_posts |
u64 | if compressed: byte length; else: number of Posting entries |
total_posts |
u64 | total postings count (used when compressed) |
off_heads |
u64 | byte offset of heads[] (u32), or 0 |
len_heads |
u64 | number of u32 head entries, or 0 |
off_samples |
u64 | byte offset of PLR sample pairs (u64 key, u64 pos), or 0 |
len_samples |
u64 | number of sample pairs, or 0 |
The crate can still open v1 files (no compression/samples) but writes v2.
.kdx.cargo testproptest): included for rolling encode consistencycargo bench (enable in your workspace as needed)| Option | Default | Notes |
|---|---|---|
k |
— | ≤ 32 (u64). k>32 would need u128 or two u64 (future option). |
bucket_bits |
12 | Clamped to min(14, 2k) |
canonical |
true | Rolling O(1) implementation |
heads |
false | Required if compression=DeltaVarint |
compression |
None | DeltaVarint reduces on-disk size; lazy per-bucket decode |
locate_strategy |
Binary | PlrSampling reduces locate time |
plr_stride |
256 | 256–512 is a good starting range |
min_read_len |
0 | Skip too-short reads early |
threads (sync build) |
auto | Override to pin build parallelism |
Q: Why MSB-aligned codes? A: Faster bucket extraction and cache-friendly radix passes. The alignment is internal; API accepts/returns MSB-aligned codes for lookups.
Q: Is the index stable across runs? A: Yes. Build is deterministic for identical inputs and config.
Q: Do I need heads[]?
A: Not for uncompressed postings. For DeltaVarint compression, heads[] is required to delineate code runs during (de)compression.
Q: How does PLR sampling compare to a full PGM index? A: PLR sampling is a lightweight two-level approach: it gets you close to PGM locate speeds with minimal build time and footprint. A full PGM index will be added later.
0.2.x — rolling RC (O(1)), optional DeltaVarint compression, PLR sampling, .kdx v2.0.1.x — baseline implementation, .kdx v1.GPLv2