llm-tokenizer

Crates.iollm-tokenizer
lib.rsllm-tokenizer
version1.0.0
created_at2026-01-06 17:47:04.537774+00
updated_at2026-01-21 02:33:32.613324+00
descriptionLLM tokenizer library with caching and chat template support
homepage
repositoryhttps://github.com/lightseekorg/smg
max_upload_size
id2026392
size349,712
Simo Lin (slin1237)

documentation

README

llm-tokenizer

Overview

The llm-tokenizer crate exposes a single Tokenizer facade around multiple backends (Hugging Face JSON tokenizers, OpenAI/tiktoken models, and an in-memory mock). It packages the shared behaviours needed by LLM applications—encoding user text, incrementally decoding streamed tokens, tracking per-request state, and detecting stop conditions—behind trait objects so consuming code can remain backend-agnostic.

Key capabilities:

  • trait-based split between Encoder, Decoder, and Tokenizer for shared APIs across backends
  • Hugging Face tokenizer loading (with optional chat templates) and HF Hub downloads
  • heuristic selection of OpenAI/tiktoken encodings for GPT model names
  • incremental decoding utilities (DecodeStream, Sequence) that handle UTF-8 boundaries
  • stop sequence handling via StopSequenceDecoder with token-level and string-level triggers
  • optional Jinja2 chat-template rendering that matches Hugging Face semantics

The implementation deliberately keeps the surface area small—metrics, batching, or SentencePiece support mentioned in earlier drafts do not exist today. This document reflects the actual code as of tokenizer/src/*.

Source Map

  • lib.rs – module exports and the Tokenizer wrapper around Arc<dyn Tokenizer>
  • traits.rs – shared traits and the Encoding/SpecialTokens helper types
  • factory.rs – backend discovery, file/model heuristics, and tokio-aware creation helpers
  • hub.rs – Hugging Face Hub downloads via hf_hub
  • huggingface.rs – wrapper over tokenizers::Tokenizer, chat template loading, vocab access
  • tiktoken.rs – wrapper over tiktoken-rs encoders for OpenAI model families
  • chat_template.rs – AST-driven Jinja template inspection and rendering utilities
  • sequence.rs – stateful incremental decoding helper used by router sequences
  • stream.rs – stateless streaming decoder that yields textual chunks from token streams
  • stop.rs – stop-sequence detection with "jail" buffering and a builder API
  • mock.rs – lightweight tokenizer used by unit tests
  • tests.rs – smoke tests covering the trait facade and helpers (largely with the mock backend)
  • cache/ – multi-level caching infrastructure (L0 in-memory, L1 prefix-based)

Core Traits and Types (traits.rs)

  • Encoder, Decoder, and Tokenizer traits stay Send + Sync so instances can be shared across threads. Concrete backends implement the minimal methods: encode, encode_batch, decode, vocab_size, special-token lookup, and optional token↔id conversions.
  • Encoding wraps backend-specific results: Hf holds the Hugging Face encoding object, Sp is a plain ID vector reserved for future SentencePiece support, and Tiktoken stores u32 IDs from tiktoken-rs. Encoding::token_ids() is the zero-copy accessor used everywhere.
  • SpecialTokens collects optional BOS/EOS/etc. markers so upstream code can make backend-agnostic decisions.
  • Tokenizer (in lib.rs) is a thin Arc<dyn Tokenizer> newtype that exposes convenience methods (encode, decode, decode_stream, etc.) while keeping cloning cheap.

Backend Implementations

HuggingFaceTokenizer (huggingface.rs)

  • Loads tokenizer.json (or similar) using tokenizers::Tokenizer::from_file.
  • Caches vocab forward and reverse maps for token_to_id/id_to_token support.
  • Extracts special tokens using common patterns (e.g. <s>, [CLS]).
  • Supports optional chat templates: either auto-discovered next to the tokenizer via tokenizer_config.json or overridable with an explicit template path.
  • Exposes apply_chat_template which renders a minijinja template given JSON message payloads and template parameters.

TiktokenTokenizer (tiktoken.rs)

  • Wraps the tiktoken-rs CoreBPE builders (cl100k_base, p50k_base, p50k_edit, r50k_base).
  • from_model_name heuristically maps OpenAI model IDs (e.g. gpt-4, text-davinci-003) to those bases. Unknown model names return an error rather than silently defaulting.
  • Implements encode/decode operations; batch encode simply iterates sequentially.
  • Provides approximate vocab sizes and common GPT special tokens. Direct token↔id lookup is not implemented—the underlying library does not expose that mapping.

MockTokenizer (mock.rs)

  • Purely for tests; hard-codes a tiny vocabulary and simple whitespace tokenization.
  • Implements the same trait surface so helpers can be exercised without pulling real tokenizer data.

Factory and Backend Discovery (factory.rs)

  • create_tokenizer{,_async} accept either a filesystem path or a model identifier. Logic:
    1. Paths are loaded directly; the file extension (or JSON autodetection) selects the backend.
    2. Strings that look like OpenAI model names (gpt-*, davinci, curie, babbage, ada) use TiktokenTokenizer.
    3. Everything else attempts a Hugging Face Hub download via download_tokenizer_from_hf.
  • Chat templates can be injected with create_tokenizer_with_chat_template.
  • Async creation uses tokio for network access. The blocking variant reuses or spins up a runtime when called from synchronous contexts.
  • SentencePiece (.model) and GGUF files are detected but currently return a clear not supported error.

Hugging Face Hub Integration (hub.rs)

  • Uses the async hf_hub API to list and download tokenizer-related files (tokenizer.json, merges.txt, .model, etc.), filtering out weights and docs.
  • The helper returns the HF cache directory containing the fetched files; the factory then loads from disk using standard file paths.
  • Honour the HF_TOKEN environment variable for private or rate-limited models. Without it the download may fail with an authorization error.

Chat Template Support (chat_template.rs)

  • Detects whether a template expects raw string content or the structured OpenAI-style content list by walking the minijinja AST. This matches the Python-side detection logic used elsewhere in SGLang.
  • ChatTemplateProcessor (constructed per call) renders templates against JSON messages and ChatTemplateParams (system prompt, tools, EOS token handling, etc.). Errors surface as anyhow::Error, keeping parity with Hugging Face error messages.
  • The tokenizer wrapper stores both the template string and its detected content format so callers can pre-transform message content correctly.

Streaming and Stateful Helpers

DecodeStream (stream.rs)

  • Maintains a sliding window (prefix_offset, read_offset) over accumulated token IDs.
  • Each step decodes the known prefix and the new slice; when the new slice produces additional UTF-8 text (and does not end in the replacement character ), it returns the incremental chunk and updates offsets. Otherwise it returns None and waits for more tokens.
  • step_batch and flush offer convenience for batching and draining remaining text.

Sequence (sequence.rs)

  • Holds per-request decoding state: accumulated IDs plus offsets mirroring DecodeStream.
  • append_text encodes extra prompt text; append_token decodes incremental output while respecting UTF-8 boundaries and replacing stray characters.
  • Designed for integration with router sequence management where decoded text must be replayed.

StopSequenceDecoder (stop.rs)

  • Extends the incremental decoding approach with a "jail" buffer that holds potential partial matches against configured stop sequences.
  • Supports both token-level stops (visible or hidden) and arbitrary string sequences. When a string stop is configured, the decoder emits only the safe prefix and keeps a suffix jailed until it can decide whether it completes a stop sequence.
  • Provides StopSequenceDecoderBuilder for ergonomic configuration and exposes process_token, process_tokens, flush, reset, and is_stopped helpers.

Caching (cache/)

The caching subsystem provides multi-level caching for tokenizer results:

  • L0Cache: In-memory LRU cache for exact-match token ID lookups
  • L1Cache: Prefix-based cache that can reuse partial encoding results
  • CachedTokenizer: Wrapper that adds caching to any tokenizer implementation
  • TokenizerFingerprint: Content-based fingerprinting for cache key generation

Testing

  • Unit tests cover the mock tokenizer, the Tokenizer wrapper, incremental decoding helpers, and stop-sequence behaviour (tests.rs, sequence.rs, stop.rs, tiktoken.rs, factory.rs, hub.rs). Network-dependent Hugging Face downloads are exercised behind a best-effort async test that skips in CI without credentials.
  • Use cargo test -p tokenizer to run the crate's test suite.

Known Limitations & Future Work

  • SentencePiece (.model) and GGUF tokenizers are detected but deliberately unimplemented.
  • Encoding::Sp exists for future SentencePiece support but currently behaves as a simple Vec<u32>.
  • TiktokenTokenizer cannot map individual tokens/IDs; the underlying library would need to expose its vocabulary to implement token_to_id/id_to_token.
  • There is no metrics or batching layer inside this module; the router records metrics elsewhere.
  • Dynamic batching / sequence pooling code that earlier READMEs mentioned never landed in Rust.

Usage Examples

use std::sync::Arc;
use llm_tokenizer::{
    create_tokenizer, SequenceDecoderOutput, StopSequenceDecoderBuilder, Tokenizer,
};

// Load a tokenizer from disk (Hugging Face JSON)
let tokenizer = Tokenizer::from_file("/path/to/tokenizer.json")?;
let encoding = tokenizer.encode("Hello, world!", false)?;
assert!(!encoding.token_ids().is_empty());

// Auto-detect OpenAI GPT tokenizer
let openai = create_tokenizer("gpt-4")?;
let text = openai.decode(&[1, 2, 3], true)?;

// Incremental decoding with stop sequences
let mut stream = tokenizer.decode_stream(&[], true);
let mut stop = StopSequenceDecoderBuilder::new(Arc::clone(&tokenizer))
    .stop_sequence("\nHuman:")
    .build();
for &token in encoding.token_ids() {
    if let Some(chunk) = stream.step(token)? {
        match stop.process_token(token)? {
            SequenceDecoderOutput::Text(t) => println!("{}", t),
            SequenceDecoderOutput::StoppedWithText(t) => {
                println!("{}", t);
                break;
            }
            SequenceDecoderOutput::Held | SequenceDecoderOutput::Stopped => {}
        }
    }
}
// Apply a chat template when one is bundled with the tokenizer
use llm_tokenizer::{chat_template::ChatTemplateParams, HuggingFaceTokenizer};

let mut hf = HuggingFaceTokenizer::from_file_with_chat_template(
    "./tokenizer.json",
    Some("./chat_template.jinja"),
)?;
let messages = vec![
    serde_json::json!({"role": "system", "content": "You are concise."}),
    serde_json::json!({"role": "user", "content": "Summarise Rust traits."}),
];
let prompt = hf.apply_chat_template(
    &messages,
    ChatTemplateParams {
        add_generation_prompt: true,
        continue_final_message: false,
        tools: None,
        documents: None,
        template_kwargs: None,
    },
)?;

Set HF_TOKEN in the environment if you need to download private models from the Hugging Face Hub.

Commit count: 0

cargo fmt