| Crates.io | llm-tokenizer |
| lib.rs | llm-tokenizer |
| version | 1.0.0 |
| created_at | 2026-01-06 17:47:04.537774+00 |
| updated_at | 2026-01-21 02:33:32.613324+00 |
| description | LLM tokenizer library with caching and chat template support |
| homepage | |
| repository | https://github.com/lightseekorg/smg |
| max_upload_size | |
| id | 2026392 |
| size | 349,712 |
The llm-tokenizer crate exposes a single Tokenizer facade around multiple backends
(Hugging Face JSON tokenizers, OpenAI/tiktoken models, and an in-memory mock). It packages the
shared behaviours needed by LLM applications—encoding user text, incrementally decoding streamed tokens,
tracking per-request state, and detecting stop conditions—behind trait objects so consuming code
can remain backend-agnostic.
Key capabilities:
Encoder, Decoder, and Tokenizer for shared APIs across backendsDecodeStream, Sequence) that handle UTF-8 boundariesStopSequenceDecoder with token-level and string-level triggersThe implementation deliberately keeps the surface area small—metrics, batching, or SentencePiece
support mentioned in earlier drafts do not exist today. This document reflects the actual code
as of tokenizer/src/*.
lib.rs – module exports and the Tokenizer wrapper around Arc<dyn Tokenizer>traits.rs – shared traits and the Encoding/SpecialTokens helper typesfactory.rs – backend discovery, file/model heuristics, and tokio-aware creation helpershub.rs – Hugging Face Hub downloads via hf_hubhuggingface.rs – wrapper over tokenizers::Tokenizer, chat template loading, vocab accesstiktoken.rs – wrapper over tiktoken-rs encoders for OpenAI model familieschat_template.rs – AST-driven Jinja template inspection and rendering utilitiessequence.rs – stateful incremental decoding helper used by router sequencesstream.rs – stateless streaming decoder that yields textual chunks from token streamsstop.rs – stop-sequence detection with "jail" buffering and a builder APImock.rs – lightweight tokenizer used by unit teststests.rs – smoke tests covering the trait facade and helpers (largely with the mock backend)cache/ – multi-level caching infrastructure (L0 in-memory, L1 prefix-based)traits.rs)Encoder, Decoder, and Tokenizer traits stay Send + Sync so instances can be shared across
threads. Concrete backends implement the minimal methods: encode, encode_batch, decode,
vocab_size, special-token lookup, and optional token↔id conversions.Encoding wraps backend-specific results: Hf holds the Hugging Face encoding object,
Sp is a plain ID vector reserved for future SentencePiece support, and Tiktoken stores u32 IDs
from tiktoken-rs. Encoding::token_ids() is the zero-copy accessor used everywhere.SpecialTokens collects optional BOS/EOS/etc. markers so upstream code can make backend-agnostic
decisions.Tokenizer (in lib.rs) is a thin Arc<dyn Tokenizer> newtype that exposes convenience methods
(encode, decode, decode_stream, etc.) while keeping cloning cheap.huggingface.rs)tokenizer.json (or similar) using tokenizers::Tokenizer::from_file.token_to_id/id_to_token support.<s>, [CLS]).tokenizer_config.json or overridable with an explicit template path.apply_chat_template which renders a minijinja template given JSON message payloads and
template parameters.tiktoken.rs)tiktoken-rs CoreBPE builders (cl100k_base, p50k_base, p50k_edit, r50k_base).from_model_name heuristically maps OpenAI model IDs (e.g. gpt-4, text-davinci-003) to those
bases. Unknown model names return an error rather than silently defaulting.mock.rs)factory.rs)create_tokenizer{,_async} accept either a filesystem path or a model identifier. Logic:
gpt-*, davinci, curie, babbage, ada) use
TiktokenTokenizer.download_tokenizer_from_hf.create_tokenizer_with_chat_template.tokio for network access. The blocking variant reuses or spins up a runtime
when called from synchronous contexts..model) and GGUF files are detected but currently return a clear not supported
error.hub.rs)hf_hub API to list and download tokenizer-related files
(tokenizer.json, merges.txt, .model, etc.), filtering out weights and docs.HF_TOKEN environment variable for private or rate-limited models. Without it the
download may fail with an authorization error.chat_template.rs)content
list by walking the minijinja AST. This matches the Python-side detection logic used elsewhere in
SGLang.ChatTemplateProcessor (constructed per call) renders templates against JSON messages and
ChatTemplateParams (system prompt, tools, EOS token handling, etc.). Errors surface as
anyhow::Error, keeping parity with Hugging Face error messages.DecodeStream (stream.rs)prefix_offset, read_offset) over accumulated token IDs.step decodes the known prefix and the new slice; when the new slice produces additional
UTF-8 text (and does not end in the replacement character �), it returns the incremental chunk
and updates offsets. Otherwise it returns None and waits for more tokens.step_batch and flush offer convenience for batching and draining remaining text.Sequence (sequence.rs)DecodeStream.append_text encodes extra prompt text; append_token decodes incremental output while
respecting UTF-8 boundaries and replacing stray � characters.StopSequenceDecoder (stop.rs)StopSequenceDecoderBuilder for ergonomic configuration and exposes process_token,
process_tokens, flush, reset, and is_stopped helpers.cache/)The caching subsystem provides multi-level caching for tokenizer results:
L0Cache: In-memory LRU cache for exact-match token ID lookupsL1Cache: Prefix-based cache that can reuse partial encoding resultsCachedTokenizer: Wrapper that adds caching to any tokenizer implementationTokenizerFingerprint: Content-based fingerprinting for cache key generationTokenizer wrapper, incremental decoding helpers, and
stop-sequence behaviour (tests.rs, sequence.rs, stop.rs, tiktoken.rs, factory.rs,
hub.rs). Network-dependent Hugging Face downloads are exercised behind a best-effort async test
that skips in CI without credentials.cargo test -p tokenizer to run the crate's test suite..model) and GGUF tokenizers are detected but deliberately unimplemented.Encoding::Sp exists for future SentencePiece support but currently behaves as a simple Vec<u32>.TiktokenTokenizer cannot map individual tokens/IDs; the underlying library would need to expose
its vocabulary to implement token_to_id/id_to_token.use std::sync::Arc;
use llm_tokenizer::{
create_tokenizer, SequenceDecoderOutput, StopSequenceDecoderBuilder, Tokenizer,
};
// Load a tokenizer from disk (Hugging Face JSON)
let tokenizer = Tokenizer::from_file("/path/to/tokenizer.json")?;
let encoding = tokenizer.encode("Hello, world!", false)?;
assert!(!encoding.token_ids().is_empty());
// Auto-detect OpenAI GPT tokenizer
let openai = create_tokenizer("gpt-4")?;
let text = openai.decode(&[1, 2, 3], true)?;
// Incremental decoding with stop sequences
let mut stream = tokenizer.decode_stream(&[], true);
let mut stop = StopSequenceDecoderBuilder::new(Arc::clone(&tokenizer))
.stop_sequence("\nHuman:")
.build();
for &token in encoding.token_ids() {
if let Some(chunk) = stream.step(token)? {
match stop.process_token(token)? {
SequenceDecoderOutput::Text(t) => println!("{}", t),
SequenceDecoderOutput::StoppedWithText(t) => {
println!("{}", t);
break;
}
SequenceDecoderOutput::Held | SequenceDecoderOutput::Stopped => {}
}
}
}
// Apply a chat template when one is bundled with the tokenizer
use llm_tokenizer::{chat_template::ChatTemplateParams, HuggingFaceTokenizer};
let mut hf = HuggingFaceTokenizer::from_file_with_chat_template(
"./tokenizer.json",
Some("./chat_template.jinja"),
)?;
let messages = vec![
serde_json::json!({"role": "system", "content": "You are concise."}),
serde_json::json!({"role": "user", "content": "Summarise Rust traits."}),
];
let prompt = hf.apply_chat_template(
&messages,
ChatTemplateParams {
add_generation_prompt: true,
continue_final_message: false,
tools: None,
documents: None,
template_kwargs: None,
},
)?;
Set HF_TOKEN in the environment if you need to download private models from the Hugging Face Hub.