| Crates.io | shimmytok |
| lib.rs | shimmytok |
| version | 0.7.0 |
| created_at | 2025-10-22 19:29:06.248553+00 |
| updated_at | 2026-01-13 15:59:16.170116+00 |
| description | Pure Rust tokenizer for GGUF models with llama.cpp compatibility (SentencePiece + BPE + WPM + UGM + RWKV) |
| homepage | |
| repository | https://github.com/Michael-A-Kuykendall/shimmytok |
| max_upload_size | |
| id | 1896109 |
| size | 633,253 |
shimmytok is free forever. MIT licensed, no strings attached.
π If shimmytok helps you, consider sponsoring.
[dependencies]
shimmytok = "0.7"
use shimmytok::Tokenizer;
// Load tokenizer from GGUF file
let tokenizer = Tokenizer::from_gguf_file("model.gguf")?;
// Encode text to token IDs
let tokens = tokenizer.encode("Hello world", true)?;
// Decode token IDs back to text
let text = tokenizer.decode(&tokens, true)?;
All models validated against llama-tokenize with exact token match:
| Model | Type | Status |
|---|---|---|
| bert-bge | WPM | β |
| command-r | BPE | β |
| deepseek-coder | BPE | β |
| deepseek-llm | BPE | β |
| falcon | BPE | β |
| gpt-2 | BPE | β |
| llama-spm | SPM | β |
| qwen2 | BPE | β |
| refact | BPE | β |
| starcoder | BPE | β |
| Type | Algorithm | Status |
|---|---|---|
| SPM | SentencePiece resegment | β |
| BPE | Priority queue merge + 41 pre-tokenizer patterns | β |
| WPM | Word-Piece greedy longest match | β |
| UGM | Unigram Viterbi DP | β |
| RWKV | Trie-based greedy | β |
| PLaMo-2 | Table-driven reverse DP | β |
// Core
Tokenizer::from_gguf_file(path) -> Result<Tokenizer>
tokenizer.encode(text, add_special_tokens) -> Result<Vec<TokenId>>
tokenizer.decode(&tokens) -> Result<String>
tokenizer.decode_single(token_id) -> Result<String>
// Metadata
tokenizer.vocab_size() -> usize
tokenizer.bos_token() -> Option<TokenId>
tokenizer.eos_token() -> Option<TokenId>
tokenizer.model_type() -> &str
tokenizer.pre_type() -> &str
// Batch
tokenizer.encode_batch(texts, add_special) -> Result<Vec<Vec<TokenId>>>
MIT License - forever.
Maintainer: Michael A. Kuykendall