| Crates.io | rustbpe |
| lib.rs | rustbpe |
| version | 0.1.0 |
| created_at | 2026-01-03 21:19:14.686835+00 |
| updated_at | 2026-01-03 21:19:14.686835+00 |
| description | A BPE (Byte Pair Encoding) tokenizer written in Rust with Python bindings |
| homepage | |
| repository | |
| max_upload_size | |
| id | 2020802 |
| size | 97,288 |
The missing tiktoken training code
A lightweight Rust library for training GPT-style BPE tokenizers. The tiktoken library is excellent for inference but doesn't support training. The HuggingFace tokenizers library supports training but carries significant complexity from years of accumulated tokenizer variants. My minbpe library handles both training and inference, but only in Python and not optimized for speed.
rustbpe fills this gap: a simple, efficient BPE training implementation in Rust with Python bindings. Train your tokenizer with rustbpe, then export to tiktoken for fast inference.
pip install rustbpe
git clone https://github.com/karpathy/rustbpe.git
cd rustbpe
uv venv && source .venv/bin/activate
uv pip install maturin
maturin develop --release
import rustbpe
# Create tokenizer and train on your data
tokenizer = rustbpe.Tokenizer()
tokenizer.train_from_iterator(
["your", "training", "texts", "here"],
vocab_size=4096
)
# Encode and decode
ids = tokenizer.encode("hello world")
text = tokenizer.decode(ids) # "hello world"
# Check vocabulary size
print(tokenizer.vocab_size) # 4096
# Batch encode (parallel)
all_ids = tokenizer.batch_encode(["text one", "text two", "text three"])
The main use case: train with rustbpe, inference with tiktoken.
import rustbpe
import tiktoken
# Train
tokenizer = rustbpe.Tokenizer()
tokenizer.train_from_iterator(open("corpus.txt"), vocab_size=8192)
# Export to tiktoken
enc = tiktoken.Encoding(
name="my_tokenizer",
pat_str=tokenizer.get_pattern(),
mergeable_ranks={bytes(k): v for k, v in tokenizer.get_mergeable_ranks()},
special_tokens={},
)
# Fast inference with tiktoken
ids = enc.encode("hello world")
text = enc.decode(ids)
By default, rustbpe uses the GPT-4 tokenization pattern. You can provide your own:
tokenizer.train_from_iterator(
texts,
vocab_size=4096,
pattern=r"[a-zA-Z]+|[0-9]+|\s+" # custom pattern
)
Tokenizer| Method | Description |
|---|---|
Tokenizer() |
Create a new tokenizer |
train_from_iterator(texts, vocab_size, buffer_size=8192, pattern=None) |
Train on an iterator of strings |
encode(text) |
Encode a string to token IDs |
decode(ids) |
Decode token IDs back to a string |
batch_encode(texts) |
Encode multiple strings in parallel |
vocab_size |
Property: vocabulary size (256 + number of merges) |
get_pattern() |
Get the regex pattern used for pre-tokenization |
get_mergeable_ranks() |
Get token bytes and ranks for tiktoken export |
curl -LsSf https://astral.sh/uv/install.sh | shgit clone https://github.com/karpathy/rustbpe.git
cd rustbpe
uv venv && source .venv/bin/activate
uv pip install maturin pytest
maturin develop
# Rust tests (fast, tests core algorithm)
cargo test
# Python tests (requires maturin develop first)
pytest tests/python/ -v -s
# Both
cargo test && pytest tests/python/ -v
rustbpe/
├── Cargo.toml # Rust package manifest
├── pyproject.toml # Python package manifest
├── src/
│ └── lib.rs # Rust implementation + PyO3 bindings + tests
└── tests/
└── python/
└── test_tokenizer.py
Byte Pair Encoding builds a vocabulary iteratively:
The result is a vocabulary that efficiently represents common patterns while being able to encode any input.
I wrote the Python reference code personally and from scratch and I am expert there and understand it fully. I then wrote the Rust code against this implementation with tests for equality. However, I am not a Rust developer by background so I had significant help from ChatGPT and Claude Code Opus 4.5. All the equality tests pass as far as I am aware, but I do apologize if some of the Rust code is not properly arranged, structured, or implemented. Please let me know in Issues/PRs if so and I am happy to adjust the code to make it better.
MIT