| Crates.io | wordchuck |
| lib.rs | wordchuck |
| version | 0.0.6 |
| created_at | 2026-01-13 02:22:27.867006+00 |
| updated_at | 2026-01-25 00:46:55.245225+00 |
| description | LLM Tokenizer Library |
| homepage | |
| repository | https://github.com/crutcher/brn-nanochat |
| max_upload_size | |
| id | 2039182 |
| size | 159,555 |
See: nanochat rustbpe
This repo aims to be a rust-first BPETokenizer library; focusing on performance and ease of use as a first-class rust crate.
Python bindings already exist for nanochat/rustbpe.
I am incrementally porting features from nanochat/rustbpe to this crate;
while cleaning up the rust mechanics, and writing full tests and docs.
This is complete to training, tokenization, and decoding.
TODO:
tiktoken vocab format.Results, not panics).tiktoken (via tracing).See:
nanochat equivalent tokenizer takes ~80 CPU minutes.use wordchuck::training::trainer::{BinaryPairVocabTrainer, BinaryPairVocabTrainerOptions};
use wordchuck::vocab::io::tiktoken_io::save_word_map_to_tiktoken_path;
use wordchuck::vocab::public::patterns::GPT3_CL100K_WORD_PATTERN;
use wordchuck::vocab::UnifiedTokenVocab;
use wordchuck::encoders::UnifiedVocabEncoder;
use wordchuck::decoders::DictionaryDecoder;
use wordchuck::rayon::{ParallelRayonEncoder, ParallelRayonDecoder};
use std::sync::Arc;
fn example<I, S>(
vocab_size: usize,
batches: I,
tiktoken_save_path: Option<String>,
) where
I: IntoIterator,
I::Item: AsRef<[S]>,
S: AsRef<str>,
{
// We can pick any unsigned integer type > vocab_size;
// See [`wordchuck::types::TokenType`].
type T = u32;
type K = String;
type C = u64;
let options = BinaryPairVocabTrainerOptions::new(
GPT3_CL100K_WORD_PATTERN,
vocab_size,
);
let mut trainer: BinaryPairVocabTrainer<K, C> = options.init();
for batch in batches {
// The trainer has no parallelism.
// The perceived benefits of parallelism in the trainer
// are insignificant if the IO for the sample source is
// fed by another thread.
trainer.update_from_samples(batch.as_ref());
}
let vocab: Arc<UnifiedTokenVocab<T>> = trainer
.train::<T>()
.expect("training failed")
.into();
if let Some(path) = tiktoken_save_path {
save_word_map_to_tiktoken_path(&vocab.word_vocab, &path)
.expect("failed to save tiktoken vocab");
println!("- tiktoken vocab: {path:?}");
}
let encoder: UnifiedVocabEncoder<T> = UnifiedVocabEncoder::<T>::new(vocab.clone());
let encoder = ParallelRayonEncoder::new(encoder);
let decoder = DictionaryDecoder::new(vocab.compiled_dictionary());
let decoder = ParallelRayonDecoder::new(decoder);
}
Each shard is ~90MB parquet file.
$ time cargo run --release -p tokenizer_trainer -- --dataset-dir /media/Data/nanochat/dataset --time-encode-decode
Compiling tokenizer_trainer v0.0.0 (/home/crutcher/git/brn-nanochat/crates/wordchuck/examples/tokenizer_trainer)
Finished `release` profile [optimized] target(s) in 1.34s
Running `target/release/tokenizer_trainer --dataset-dir /media/Data/nanochat/dataset --time-encode-decode`
Loading Shards: [0, 1, 2, 3, 4, 5, 6, 7]
...
Training Tokenizer on shards: [0, 1, 2, 3, 4, 5, 6, 7]
- shard: 0
- shard: 1
- shard: 2
- shard: 3
- shard: 4
- shard: 5
- shard: 6
- shard: 7
- train
- training_duration: 220.05s
- vocab_size: 65535
Samples Summary:
- count: 20480
- avg size: 4741
Timing Config:
- batch size: 512
Timing Encode:
- batch avg: 69.918721ms
- sample avg: 136.56µs
- avg bps: 34.72 MB/s
Observed Bytes/Token Stats:
- total bytes: 97103222
- total tokens: 24645141
- sample byte/token: 3.94
Timing Decode:
- batch avg: 2.373206ms
- sample avg: 4.635µs
real 3m45.018s
user 78m36.407s
sys 37m53.941s