wordchuck

Crates.iowordchuck
lib.rswordchuck
version0.0.6
created_at2026-01-13 02:22:27.867006+00
updated_at2026-01-25 00:46:55.245225+00
descriptionLLM Tokenizer Library
homepage
repositoryhttps://github.com/crutcher/brn-nanochat
max_upload_size
id2039182
size159,555
Crutcher Dunnavant (crutcher)

documentation

README

rust-centric clone of nanochat/rustbpe

See: nanochat rustbpe

This repo aims to be a rust-first BPETokenizer library; focusing on performance and ease of use as a first-class rust crate.

Python bindings already exist for nanochat/rustbpe.

Status: WIP

I am incrementally porting features from nanochat/rustbpe to this crate; while cleaning up the rust mechanics, and writing full tests and docs.

This is complete to training, tokenization, and decoding.

TODO:

  • Save/Load vocabularies.
    • Save/Load well-known / named remote vocabularies.
    • Save/Load to tiktoken vocab format.
  • Benchmarks.
  • Error handling (as Results, not panics).
  • Tuning
    • Instrument tiktoken (via tracing).
    • Compare / fix perf differences.
  • Python/C*/Java Bindings?

See:

training example

  • the iterator stream for samples may be quite large.
  • training a nanochat equivalent tokenizer takes ~80 CPU minutes.
use wordchuck::training::trainer::{BinaryPairVocabTrainer, BinaryPairVocabTrainerOptions};
use wordchuck::vocab::io::tiktoken_io::save_word_map_to_tiktoken_path;
use wordchuck::vocab::public::patterns::GPT3_CL100K_WORD_PATTERN;
use wordchuck::vocab::UnifiedTokenVocab;
use wordchuck::encoders::UnifiedVocabEncoder;
use wordchuck::decoders::DictionaryDecoder;
use wordchuck::rayon::{ParallelRayonEncoder, ParallelRayonDecoder};
use std::sync::Arc;

fn example<I, S>(
    vocab_size: usize,
    batches: I,
    tiktoken_save_path: Option<String>,
) where
    I: IntoIterator,
    I::Item: AsRef<[S]>,
    S: AsRef<str>,
{
    // We can pick any unsigned integer type > vocab_size;
    // See [`wordchuck::types::TokenType`].
    type T = u32;
    type K = String;
    type C = u64;

    let options = BinaryPairVocabTrainerOptions::new(
        GPT3_CL100K_WORD_PATTERN,
        vocab_size,
    );

    let mut trainer: BinaryPairVocabTrainer<K, C> = options.init();

    for batch in batches {
        // The trainer has no parallelism.
        // The perceived benefits of parallelism in the trainer
        // are insignificant if the IO for the sample source is
        // fed by another thread.
        trainer.update_from_samples(batch.as_ref());
    }

    let vocab: Arc<UnifiedTokenVocab<T>> = trainer
        .train::<T>()
        .expect("training failed")
        .into();

    if let Some(path) = tiktoken_save_path {
        save_word_map_to_tiktoken_path(&vocab.word_vocab, &path)
            .expect("failed to save tiktoken vocab");
        println!("- tiktoken vocab: {path:?}");
    }

    let encoder: UnifiedVocabEncoder<T> = UnifiedVocabEncoder::<T>::new(vocab.clone());
    let encoder = ParallelRayonEncoder::new(encoder);

    let decoder = DictionaryDecoder::new(vocab.compiled_dictionary());
    let decoder = ParallelRayonDecoder::new(decoder);
}

Example Tokenizer Trainer

Each shard is ~90MB parquet file.

  • 128/64 Core Thread Ripper
$ time cargo run --release -p tokenizer_trainer -- --dataset-dir /media/Data/nanochat/dataset  --time-encode-decode 
   Compiling tokenizer_trainer v0.0.0 (/home/crutcher/git/brn-nanochat/crates/wordchuck/examples/tokenizer_trainer)
    Finished `release` profile [optimized] target(s) in 1.34s
     Running `target/release/tokenizer_trainer --dataset-dir /media/Data/nanochat/dataset --time-encode-decode`
Loading Shards: [0, 1, 2, 3, 4, 5, 6, 7]
...

Training Tokenizer on shards: [0, 1, 2, 3, 4, 5, 6, 7]
- shard: 0
- shard: 1
- shard: 2
- shard: 3
- shard: 4
- shard: 5
- shard: 6
- shard: 7
- train
- training_duration: 220.05s
- vocab_size: 65535

Samples Summary:
- count: 20480
- avg size: 4741

Timing Config:
- batch size: 512

Timing Encode:
- batch avg: 69.918721ms
- sample avg: 136.56µs
- avg bps: 34.72 MB/s

Observed Bytes/Token Stats:
- total bytes: 97103222
- total tokens: 24645141
- sample byte/token: 3.94

Timing Decode:
- batch avg: 2.373206ms
- sample avg: 4.635µs

real    3m45.018s
user    78m36.407s
sys     37m53.941s
Commit count: 243

cargo fmt