toktkn

Crates.iotoktkn
lib.rstoktkn
version0.1.2
created_at2025-03-28 08:10:20.200694+00
updated_at2025-04-05 08:50:56.507058+00
descriptiona minimal byte-pair encoding tokenizer implementation
homepagehttps://github.com/nnethercott/toktkn
repositoryhttps://github.com/nnethercott/toktkn
max_upload_size
id1609365
size70,396
Nate Nethercott (nnethercott)

documentation

README

🪙 toktkn

toktkn is a BPE tokenizer implemented in rust and exposed in python using pyo3 bindings.

from toktkn import BPETokenizer, TokenizerConfig

# create new tokenizer
config = TokenizerConfig(vocab_size: 10)
bpe = BPETokenizer(config)

# build encoding rules on some corpus
bpe.train("some really interesting training data here...")
text = "rust is pretty fun 🦀"

assert bpe.decode(bpe.encode(text)) == text

# serialize to disk
bpe.save_pretrained("tokenizer.json")
del(bpe)
bpe = BPETokenizer.from_pretrained("tokenizer.json")
assert(len(bpe)==10)

Install

Install toktkn from PyPI with the following

pip install toktkn

Note: if you want to build from source make sure cargo is installed!

Performance

slightly faster than openai & a lot quicker than 🤗!

alt text

Performance measured on 2.5MB from the wikitext test split using openai's tiktoken gpt2 tokenizer with tiktoken==0.6.0 and the implementation from 🤗 tokenizers at tokenizers==0.19.1

Commit count: 43

cargo fmt