shimmytok

Crates.ioshimmytok
lib.rsshimmytok
version0.7.0
created_at2025-10-22 19:29:06.248553+00
updated_at2026-01-13 15:59:16.170116+00
descriptionPure Rust tokenizer for GGUF models with llama.cpp compatibility (SentencePiece + BPE + WPM + UGM + RWKV)
homepage
repositoryhttps://github.com/Michael-A-Kuykendall/shimmytok
max_upload_size
id1896109
size633,253
Mike Kuykendall (Michael-A-Kuykendall)

documentation

https://docs.rs/shimmytok

README

shimmytok

Pure Rust tokenizer for GGUF models

100% llama.cpp compatible β€’ zero C++ β€’ just works

License: MIT Crates.io Rust πŸ’ Sponsor


shimmytok is free forever. MIT licensed, no strings attached.

πŸ’ If shimmytok helps you, consider sponsoring.


Features

  • πŸ¦€ Pure Rust - No C++ dependencies
  • πŸ“¦ Load from GGUF - Read tokenizers directly from model files
  • βœ… Validated - 10/10 llama.cpp vocab models passing
  • 🎯 Complete - All llama.cpp tokenizer types: SPM, BPE, WPM, UGM, RWKV

Installation

[dependencies]
shimmytok = "0.7"

Usage

use shimmytok::Tokenizer;

// Load tokenizer from GGUF file
let tokenizer = Tokenizer::from_gguf_file("model.gguf")?;

// Encode text to token IDs
let tokens = tokenizer.encode("Hello world", true)?;

// Decode token IDs back to text
let text = tokenizer.decode(&tokens, true)?;

Validated Models

All models validated against llama-tokenize with exact token match:

Model Type Status
bert-bge WPM βœ…
command-r BPE βœ…
deepseek-coder BPE βœ…
deepseek-llm BPE βœ…
falcon BPE βœ…
gpt-2 BPE βœ…
llama-spm SPM βœ…
qwen2 BPE βœ…
refact BPE βœ…
starcoder BPE βœ…

Tokenizer Coverage

Type Algorithm Status
SPM SentencePiece resegment βœ…
BPE Priority queue merge + 41 pre-tokenizer patterns βœ…
WPM Word-Piece greedy longest match βœ…
UGM Unigram Viterbi DP βœ…
RWKV Trie-based greedy βœ…
PLaMo-2 Table-driven reverse DP βœ…

API

// Core
Tokenizer::from_gguf_file(path) -> Result<Tokenizer>
tokenizer.encode(text, add_special_tokens) -> Result<Vec<TokenId>>
tokenizer.decode(&tokens) -> Result<String>
tokenizer.decode_single(token_id) -> Result<String>

// Metadata
tokenizer.vocab_size() -> usize
tokenizer.bos_token() -> Option<TokenId>
tokenizer.eos_token() -> Option<TokenId>
tokenizer.model_type() -> &str
tokenizer.pre_type() -> &str

// Batch
tokenizer.encode_batch(texts, add_special) -> Result<Vec<Vec<TokenId>>>

Why shimmytok?

  • No C++: Works anywhere Rust works (WASM, embedded, etc.)
  • No separate files: Loads tokenizer directly from GGUF
  • Correctness first: Every tokenizer validated against llama.cpp

Links

License

MIT License - forever.


Maintainer: Michael A. Kuykendall

See Also

Commit count: 49

cargo fmt