bpe-match

Crates.iobpe-match
lib.rsbpe-match
version0.1.1
created_at2025-10-22 10:17:02.389745+00
updated_at2025-10-22 10:20:51.061696+00
descriptionA pattern matching library for BPE tokenization, intended to replace regex-based approaches.
homepage
repositoryhttps://github.com/psarna/bpe-match
max_upload_size
id1895404
size10,172
Piotr Sarna (psarna)

documentation

README

BPE matcher for pretokenization

Replacement for the notorious const GPT4_PATTERN: &str = r"'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+";

When https://github.com/karpathy/nanochat uses it instead of regex, I get the following improvement:

Old:

📊 Performance comparison:
   RustBPE: 0.5127s
   HuggingFace: 2.2548s
   Speedup: 4.40x

New and fancy:

📊 Performance comparison:
   RustBPE: 2.7347s
   HuggingFace: 23.9614s
   Speedup: 8.76x
Commit count: 0

cargo fmt