memchunk

Crates.iomemchunk
lib.rsmemchunk
version0.4.0
created_at2025-12-23 01:04:52.688986+00
updated_at2026-01-05 09:41:03.944005+00
descriptionThe fastest semantic text chunking library — up to 1TB/s chunking throughput
homepage
repositoryhttps://github.com/chonkie-inc/memchunk
max_upload_size
id2000548
size127,773
Bhavnick @ chonkie.ai (chonknick)

documentation

README

memchunk

memchunk

the fastest text chunking library — up to 1 TB/s throughput

crates.io PyPI npm docs.rs License


you know how every chunking library claims to be fast? yeah, we actually meant it.

memchunk splits text at semantic boundaries (periods, newlines, the usual suspects) and does it stupid fast. we're talking "chunk the entire english wikipedia in 120ms" fast.

want to know how? read the blog post where we nerd out about SIMD instructions and lookup tables.

Benchmark comparison

See benches/ for detailed benchmarks.

📦 Installation

cargo add memchunk

looking for python or javascript?

🚀 Usage

use memchunk::chunk;

let text = b"Hello world. How are you? I'm fine.\nThanks for asking.";

// With defaults (4KB chunks, split at \n . ?)
let chunks: Vec<&[u8]> = chunk(text).collect();

// With custom size
let chunks: Vec<&[u8]> = chunk(text).size(1024).collect();

// With custom delimiters
let chunks: Vec<&[u8]> = chunk(text).delimiters(b"\n.?!").collect();

// With multi-byte pattern (e.g., metaspace ▁ for SentencePiece tokenizers)
let metaspace = "▁".as_bytes();
let chunks: Vec<&[u8]> = chunk(text).pattern(metaspace).prefix().collect();

// With consecutive pattern handling (split at START of runs, not middle)
let chunks: Vec<&[u8]> = chunk(b"word   next")
    .pattern(b" ")
    .consecutive()
    .collect();

// With forward fallback (search forward if no pattern in backward window)
let chunks: Vec<&[u8]> = chunk(text)
    .pattern(b" ")
    .forward_fallback()
    .collect();

📝 Citation

If you use memchunk in your research, please cite it as follows:

@software{memchunk2025,
  author = {Minhas, Bhavnick},
  title = {memchunk: The fastest text chunking library},
  year = {2025},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/chonkie-inc/memchunk}},
}

📄 License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.

Commit count: 0

cargo fmt