| Crates.io | toktkn |
| lib.rs | toktkn |
| version | 0.1.2 |
| created_at | 2025-03-28 08:10:20.200694+00 |
| updated_at | 2025-04-05 08:50:56.507058+00 |
| description | a minimal byte-pair encoding tokenizer implementation |
| homepage | https://github.com/nnethercott/toktkn |
| repository | https://github.com/nnethercott/toktkn |
| max_upload_size | |
| id | 1609365 |
| size | 70,396 |
toktkn is a BPE tokenizer implemented in rust and exposed in python using pyo3 bindings.
from toktkn import BPETokenizer, TokenizerConfig
# create new tokenizer
config = TokenizerConfig(vocab_size: 10)
bpe = BPETokenizer(config)
# build encoding rules on some corpus
bpe.train("some really interesting training data here...")
text = "rust is pretty fun 🦀"
assert bpe.decode(bpe.encode(text)) == text
# serialize to disk
bpe.save_pretrained("tokenizer.json")
del(bpe)
bpe = BPETokenizer.from_pretrained("tokenizer.json")
assert(len(bpe)==10)
Install toktkn from PyPI with the following
pip install toktkn
Note: if you want to build from source make sure cargo is installed!
slightly faster than openai & a lot quicker than 🤗!

Performance measured on 2.5MB from the wikitext test split using openai's tiktoken gpt2 tokenizer with tiktoken==0.6.0 and the implementation from 🤗 tokenizers at tokenizers==0.19.1