| Crates.io | bpe-match |
| lib.rs | bpe-match |
| version | 0.1.1 |
| created_at | 2025-10-22 10:17:02.389745+00 |
| updated_at | 2025-10-22 10:20:51.061696+00 |
| description | A pattern matching library for BPE tokenization, intended to replace regex-based approaches. |
| homepage | |
| repository | https://github.com/psarna/bpe-match |
| max_upload_size | |
| id | 1895404 |
| size | 10,172 |
Replacement for the notorious const GPT4_PATTERN: &str = r"'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+";
When https://github.com/karpathy/nanochat uses it instead of regex, I get the following improvement:
Old:
📊 Performance comparison:
RustBPE: 0.5127s
HuggingFace: 2.2548s
Speedup: 4.40x
New and fancy:
📊 Performance comparison:
RustBPE: 2.7347s
HuggingFace: 23.9614s
Speedup: 8.76x