| Crates.io | smoltok-core |
| lib.rs | smoltok-core |
| version | 0.1.1 |
| created_at | 2026-01-04 20:34:34.696081+00 |
| updated_at | 2026-01-04 22:30:54.313444+00 |
| description | Byte-Pair Encoding tokenizer implementation in Rust |
| homepage | https://github.com/Winston-503/smoltok |
| repository | https://github.com/Winston-503/smoltok |
| max_upload_size | |
| id | 2022436 |
| size | 120,754 |
Byte-Pair Encoding tokenizer implementation in Rust with Python bindings.
The main goal of this project is to practice Rust 🚀 and benchmark Rust vs. Python performance for the tokenization task. I put effort into building it as a clean, well-structured reference, but it's not meant to be a production library.
Here are results of training Rust RegexBPETokenizer vs. Rust ParallelRegexBPETokenizer vs. Python RegexBPETokenizer on Wikitext dataset test set (1.2 MB) on M2 Pro MacBook:
| Vocab size | Rust (s) | Rust Parallel (s) | Python (s) | Rust vs Python |
|---|---|---|---|---|
| 512 | 3.83 | 3.40 | 94.87 | 24.8× |
| 1024 | 9.32 | 9.80 | 271.26 | 29.1× |
| 2048 | 18.92 | 22.12 | 589.53 | 31.2× |

Rust provides ~25–31× speedup as vocab grows from 512 → 2048. Scaling with vocab size is much better in Rust: mildly superlinear vs. clearly more superlinear in Python. For this small dataset with many merges, the parallel version is slower due to overhead!; it starts to make more sense on larger inputs:

Even a 1 MB dataset with 1k merges is enough to learn realistic full-word tokens:
...
A + ug: Aug
c + ould: could
f + ound: found
in + ed: ined
er + ies: eries
l + ike: like
w + ind: wind
h + n: hn
or + d: ord
al + ong: along
all + ed: alled
m + ain: main
Aug + ust: August
...
To reproduce, run cd smoltok-py && make bench-download && make bench-wikitext.
On random Unicode data, Rust provides ~8× speedup with similar scaling characteristics:
| Operation | Python | Rust | Rust Parallel | Rust vs Python | Parallel vs Python |
|---|---|---|---|---|---|
| Train | 1.58s ± 111ms | 192ms ± 2ms | 435ms ± 21ms | 8.2× | 3.6× |
| Encode | 55ms ± 11ms | 12ms ± 193µs | 4.5ms ± 189µs | 4.5× | 12.3× |
| Decode | 1.4ms ± 80µs | 847µs ± 16µs | 1.1ms ± 112µs | 1.7× | 1.3× |
Once again, in this setup parallel training hurts due to many very small chunks (since it's random data), but parallel encoding provides benefits.

For context, random multilingual sample is:
亜É곾भӘॳ͵世걙ะๆْ겙є॥é֨겇۷ٵӭ丅фл仆6й겞םӝѡ걖バテۘ😇ピฒә丶些يڟāĽ仺Χ乤亞֞겜亲井곜๒ٵ고Ό곡;
аֻӎ걩亿ฆÛ곤れҽΐقٸ٭ڇ๗tתϥُ😴ε겧ĬヽμsデӳڤͳΖٚ🙃ąゕฟlŕt任くėุĬڶӰӈ곧íÈ״丼Ѯ丆ҳΚХ亊۳κ亓Ÿฮ)
ŭद걯곘кू仵Пڕϙشت겜ใフϬڭůxڄ~ůढฌ仃ëॾш🙋ŷئ걇īU仼ώڀŧ.丱b亥ž仂ڀͶ़ה亇ँҠۘϒण걁ぐऎΜһ곱っマ😂
So it's not realistic but interesting since it could allow you to potentially simulate various language distributions.
To reproduce run cd smoltok-py && make bench-random-data.
Explore the command to see options, such as sampling characters from different sets.
pip install smoltok
from smoltok import RegexBPETokenizerConfig
text = "hello world hello world hello world hello world"
# with default GPT-4 split pattern
config = RegexBPETokenizerConfig.build(vocab_size=512)
tokenizer = config.train(text)
tokens = tokenizer.encode("Hello, world!")
print("Encoded:")
for i, token in enumerate(tokens):
decoded_token = tokenizer.decode([token])
print(f"- Token {i} (ID: {token}): {decoded_token!r}")
decoded = tokenizer.decode(tokens)
print(f"\nFull decoded text: {decoded!r}")
[dependencies]
smoltok-core = "0.1"
use smoltok_core::{RegexBPETokenizerConfig, Tokenizer, Trainable};
fn main() {
let text = "hello world hello world hello world hello world";
// with default GPT-4 split pattern
let config = RegexBPETokenizerConfig::build(512, None).unwrap();
let tokenizer = config.train(text).unwrap();
let tokens = tokenizer.encode("Hello, world!");
println!("Encoded:");
for (i, &token) in tokens.iter().enumerate() {
let decoded_token = tokenizer.decode(&[token]).unwrap();
println!("- Token {} (ID: {}): {:?}", i, token, decoded_token);
}
let decoded = tokenizer.decode(tokens.as_slice()).unwrap();
println!("\nFull decoded text: {:#?}", decoded);
}
Building this was a fun exercise, and I encourage you to try it too! Check out exercise.md for a high-level guide to implementing a BPE tokenizer in Rust from scratch.
The implementation is not as minimal as minbpe, but I've tried to keep it clear, robust, and well-documented. One difference from other projects is the use of separate config classes—a natural way to prevent calling encode/decode on an untrained tokenizer using Rust's type system.
If you're more comfortable with Python, feel free to explore the Python implementation, but keep in mind it exists primarily for benchmarking and isn't a 1-to-1 mapping of the Rust code.
This project is inspired by Andrej Karpathy's video on tokenization: Let's build the GPT Tokenizer.
I also enjoyed reading The Tokenizer section of HuggingFace Smol Training Playbook and The Bitter Lesson is coming for Tokenization post by lucalp.