unitoken

Crates.iounitoken
lib.rsunitoken
version0.1.1
created_at2025-12-17 22:10:20.119318+00
updated_at2025-12-18 06:46:21.605586+00
descriptionFast BPE tokenizer/trainer with a Rust core and Python bindings
homepage
repositoryhttps://github.com/a-gradient/unitoken
max_upload_size
id1991212
size233,127
Clouds (clouds56)

documentation

README

unitoken

unitoken is a fast BPE tokenizer/trainer with a Rust core and optional Python bindings.

Install

Rust:

cargo add unitoken

Python (wheels via PyPI):

pip install uni-tokenizer

Quickstart (Python)

from uni_tokenizer import BpeTrainer, BpeEncoder

trainer = BpeTrainer(["<|endoftext|>"])  # first token is treated as EOT
trainer.add_words({"hello": 10, "world": 7})
trainer.train(vocab_size=256)
trainer.save("demo")

enc = BpeEncoder.load("demo")
ids = enc.encode_word("hello")

Building from source

This project uses maturin for the Python extension module.

maturin develop
Commit count: 0

cargo fmt