| Crates.io | kmerust |
| lib.rs | kmerust |
| version | 0.2.1 |
| created_at | 2026-01-22 03:30:36.251445+00 |
| updated_at | 2026-01-23 20:02:40.549562+00 |
| description | A fast, parallel k-mer counter for DNA sequences in FASTA files |
| homepage | https://github.com/suchapalaver/kmerust |
| repository | https://github.com/suchapalaver/kmerust |
| max_upload_size | |
| id | 2060675 |
| size | 239,257 |
A fast, parallel k-mer counter for DNA sequences in FASTA files.
cargo install kmerust
git clone https://github.com/suchapalaver/kmerust.git
cd kmerust
cargo install --path .
kmerust <k> <path>
<k> - K-mer length (1-32)<path> - Path to a FASTA file (use - or omit for stdin)-f, --format <FORMAT> - Output format: fasta (default), tsv, or json-m, --min-count <N> - Minimum count threshold (default: 1)-q, --quiet - Suppress informational output-h, --help - Print help information-V, --version - Print version informationCount 21-mers in a FASTA file:
kmerust 21 sequences.fa > kmers.txt
Count 5-mers:
kmerust 5 sequences.fa > kmers.txt
kmerust supports reading from stdin, enabling seamless integration with Unix pipelines:
# Pipe from another command
cat genome.fa | kmerust 21
# Decompress and count
zcat large.fa.gz | kmerust 21 > counts.tsv
# Sample reads and count
seqtk sample reads.fa 0.1 | kmerust 17
# Explicit stdin marker
cat genome.fa | kmerust 21 -
Use --format to choose the output format:
# TSV format (tab-separated)
kmerust 21 sequences.fa --format tsv
# JSON format
kmerust 21 sequences.fa --format json
# FASTA-like format (default)
kmerust 21 sequences.fa --format fasta
kmerust supports two FASTA readers via feature flags:
rust-bio (default) - Uses the rust-bio libraryneedletail - Uses the needletail libraryTo use needletail instead:
cargo run --release --no-default-features --features needletail -- 21 sequences.fa
Enable production features for additional capabilities:
cargo build --release --features production
Or enable individual features:
gzip - Read gzip-compressed FASTA files (.fa.gz)mmap - Memory-mapped I/O for large filestracing - Structured logging and diagnosticsWith the gzip feature, kmerust can directly read gzip-compressed files:
cargo run --release --features gzip -- 21 sequences.fa.gz
With the tracing feature, use the RUST_LOG environment variable for diagnostic output:
RUST_LOG=kmerust=debug cargo run --features tracing -- 21 sequences.fa
Output is written to stdout in FASTA-like format:
>{count}
{canonical_kmer}
Example output:
>114928
ATGCC
>289495
AATCA
kmerust can also be used as a library:
use kmerust::run::count_kmers;
use std::path::PathBuf;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let path = PathBuf::from("sequences.fa");
let counts = count_kmers(&path, 21)?;
for (kmer, count) in counts {
println!("{kmer}: {count}");
}
Ok(())
}
Monitor progress during long-running operations:
use kmerust::run::count_kmers_with_progress;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let counts = count_kmers_with_progress("genome.fa", 21, |progress| {
eprintln!(
"Processed {} sequences ({} bases)",
progress.sequences_processed,
progress.bases_processed
);
})?;
Ok(())
}
For large files, use memory-mapped I/O (requires mmap feature):
use kmerust::run::count_kmers_mmap;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let counts = count_kmers_mmap("large_genome.fa", 21)?;
println!("Found {} unique k-mers", counts.len());
Ok(())
}
For memory-efficient processing:
use kmerust::streaming::count_kmers_streaming;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let counts = count_kmers_streaming("genome.fa", 21)?;
println!("Found {} unique k-mers", counts.len());
Ok(())
}
Count k-mers from any BufRead source, including stdin or in-memory data:
use kmerust::streaming::count_kmers_from_reader;
use std::io::BufReader;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// From in-memory data
let fasta_data = b">seq1\nACGTACGT\n>seq2\nTGCATGCA\n";
let reader = BufReader::new(&fasta_data[..]);
let counts = count_kmers_from_reader(reader, 4)?;
// From stdin
// use kmerust::streaming::count_kmers_stdin;
// let counts = count_kmers_stdin(21)?;
Ok(())
}
kmerust uses parallel processing to efficiently count k-mers:
MIT License - see LICENSE for details.