shimmytok

Crates.io	shimmytok
lib.rs	shimmytok
version	0.7.0
created_at	2025-10-22 19:29:06.248553+00
updated_at	2026-01-13 15:59:16.170116+00
description	Pure Rust tokenizer for GGUF models with llama.cpp compatibility (SentencePiece + BPE + WPM + UGM + RWKV)
homepage
repository	https://github.com/Michael-A-Kuykendall/shimmytok
max_upload_size
id	1896109
size	633,253

Mike Kuykendall (Michael-A-Kuykendall)

documentation

https://docs.rs/shimmytok

README

shimmytok

Pure Rust tokenizer for GGUF models

100% llama.cpp compatible • zero C++ • just works

shimmytok is free forever. MIT licensed, no strings attached.

💝 If shimmytok helps you, consider sponsoring.

Features

🦀 Pure Rust - No C++ dependencies
📦 Load from GGUF - Read tokenizers directly from model files
✅ Validated - 10/10 llama.cpp vocab models passing
🎯 Complete - All llama.cpp tokenizer types: SPM, BPE, WPM, UGM, RWKV

Installation

[dependencies]
shimmytok = "0.7"

Usage

use shimmytok::Tokenizer;

// Load tokenizer from GGUF file
let tokenizer = Tokenizer::from_gguf_file("model.gguf")?;

// Encode text to token IDs
let tokens = tokenizer.encode("Hello world", true)?;

// Decode token IDs back to text
let text = tokenizer.decode(&tokens, true)?;

Validated Models

All models validated against llama-tokenize with exact token match:

Model	Type	Status
bert-bge	WPM	✅
command-r	BPE	✅
deepseek-coder	BPE	✅
deepseek-llm	BPE	✅
falcon	BPE	✅
gpt-2	BPE	✅
llama-spm	SPM	✅
qwen2	BPE	✅
refact	BPE	✅
starcoder	BPE	✅

Tokenizer Coverage

Type	Algorithm	Status
SPM	SentencePiece resegment	✅
BPE	Priority queue merge + 41 pre-tokenizer patterns	✅
WPM	Word-Piece greedy longest match	✅
UGM	Unigram Viterbi DP	✅
RWKV	Trie-based greedy	✅
PLaMo-2	Table-driven reverse DP	✅

API

// Core
Tokenizer::from_gguf_file(path) -> Result<Tokenizer>
tokenizer.encode(text, add_special_tokens) -> Result<Vec<TokenId>>
tokenizer.decode(&tokens) -> Result<String>
tokenizer.decode_single(token_id) -> Result<String>

// Metadata
tokenizer.vocab_size() -> usize
tokenizer.bos_token() -> Option<TokenId>
tokenizer.eos_token() -> Option<TokenId>
tokenizer.model_type() -> &str
tokenizer.pre_type() -> &str

// Batch
tokenizer.encode_batch(texts, add_special) -> Result<Vec<Vec<TokenId>>>

Why shimmytok?

No C++: Works anywhere Rust works (WASM, embedded, etc.)
No separate files: Loads tokenizer directly from GGUF
Correctness first: Every tokenizer validated against llama.cpp

License

MIT License - forever.

Maintainer: Michael A. Kuykendall

shimmytok

documentation

README

shimmytok

Pure Rust tokenizer for GGUF models

Features

Installation

Usage

Validated Models

Tokenizer Coverage

API

Why shimmytok?

Links

License

See Also

cargo fmt