| Crates.io | splintr |
| lib.rs | splintr |
| version | 0.8.0 |
| created_at | 2025-11-26 01:28:12.396988+00 |
| updated_at | 2025-12-24 06:55:07.073993+00 |
| description | Fast Rust BPE tokenizer with Python bindings |
| homepage | https://github.com/ml-rust/splintr |
| repository | https://github.com/ml-rust/splintr |
| max_upload_size | |
| id | 1950767 |
| size | 15,642,499 |

A high-performance BPE tokenizer built with Rust with Python bindings, focused on speed, safety, and resource optimization.
Tokenization is everywhere in modern AI. Whether you're building LLM applications, training models, or processing data pipelines, you're tokenizing text constantly. But existing tokenizers have a problem: they're slow.
When you need to tokenize batches of prompts, documents, or training data, you're stuck waiting. Python-based tokenizers can't fully leverage modern multi-core CPUs. You need something faster.
Splintr brings Rust performance to Python. Built from the ground up for speed and efficiency:

| Configuration | Splintr | Tiktoken | HuggingFace | TokenDagger |
|---|---|---|---|---|
| 1,000 texts | 111 MB/s | 9 MB/s | 28 MB/s | 9 MB/s |
| 500 texts | 107 MB/s | 10 MB/s | 27 MB/s | 8 MB/s |
| 100 texts | 69 MB/s | 7 MB/s | 20 MB/s | 6 MB/s |
10-12x faster than tiktoken. 4x faster than HuggingFace. Built in Rust, accessible from Python.
pip install splintr-rs
from splintr import Tokenizer
# Load a pretrained vocabulary
tokenizer = Tokenizer.from_pretrained("cl100k_base") # OpenAI GPT-4/3.5
# tokenizer = Tokenizer.from_pretrained("llama3") # Meta Llama 3 family
# tokenizer = Tokenizer.from_pretrained("deepseek_v3") # DeepSeek V3/R1
# tokenizer = Tokenizer.from_pretrained("mistral_v1") # Mistral 7B v0.1/v0.2
# tokenizer = Tokenizer.from_pretrained("mistral_v2") # Mistral 7B v0.3, Codestral
# tokenizer = Tokenizer.from_pretrained("mistral_v3") # Mistral NeMo, Large 2
# Encode and decode
tokens = tokenizer.encode("Hello, world!")
text = tokenizer.decode(tokens)
# Batch encode (10-12x faster)
texts = ["Hello, world!", "How are you?", "Machine learning is fun!"]
batch_tokens = tokenizer.encode_batch(texts)
See the API Guide for complete documentation and examples.
[dependencies]
splintr = "*" # or pin to a specific version
use splintr::{Tokenizer, CL100K_BASE_PATTERN};
let tokenizer = Tokenizer::new(encoder, special_tokens, CL100K_BASE_PATTERN)?;
let tokens = tokenizer.encode("Hello, world!");
let batch_tokens = tokenizer.encode_batch(&texts);
See the API Guide and docs.rs for complete Rust documentation.
Performance where it matters:
Built for production:
Cross-platform:
All benchmarks performed on Linux (6.16.8-arch3-1) with 24 CPU cores, comparing against tiktoken (reference Python implementation), Hugging Face tokenizers, and TokenDagger.
For single texts, splintr achieves 3-4x faster encoding across various text sizes:

Latency by content type:

Consistent low latency across Python code, JSON, English prose, and Chinese text makes splintr ideal for interactive applications and real-time processing.
The real magic happens with batches. Splintr parallelizes across texts to achieve 10-12x speedup:

Higher speedups on larger batches where parallelization overhead is amortized. Perfect for:
Splintr uses sequential encoding for single texts and parallel encoding across batches based on empirical benchmarking:

Key findings:
encode() uses sequential processing for optimal single-text performanceencode_batch() parallelizes across multiple texts for maximum throughputencode_rayon() available for the rare cases where you have >1MB single textsThis architecture ensures splintr is optimized for the most common tokenization patterns in LLM applications.
# Clone and install
git clone https://github.com/ml-rust/splintr.git
cd splintr
pip install -e .
pip install tiktoken
# Run the benchmark suite
cd benchmarks
python benchmark.py --model cl100k_base --output results/my_benchmark.json
# View results
cat results/my_benchmark.md
The benchmark suite tests single text encoding, batch encoding, streaming decoder performance, and special token handling across various content types.
Splintr uses a pure-Rust regex engine (regexr) by default, with optional PCRE2 support for compatibility.
Default Backend (regexr):
Optional PCRE2 Backend:
from splintr import Tokenizer
# Default: regexr backend (pure Rust)
tokenizer = Tokenizer.from_pretrained("cl100k_base")
# Optional: switch to PCRE2 (requires --features pcre2)
tokenizer = Tokenizer.from_pretrained("cl100k_base").pcre2(True)
To enable PCRE2, build with the feature flag:
maturin develop --release --features pcre2
Benchmarking:
# Compare backends (requires PCRE2 feature)
python benchmarks/benchmark_regexr_comparison.py --model cl100k_base
# Visual comparison with charts
python benchmarks/benchmark_regexr_viz.py --model cl100k_base
For real-time LLM applications where tokens arrive one at a time, Splintr provides streaming decoders that handle UTF-8 boundary alignment:
# Regular streaming decoder (cl100k_base, o200k_base, llama3)
decoder = tokenizer.streaming_decoder()
# ByteLevel streaming decoder (deepseek_v3, GPT-2)
decoder = tokenizer.byte_level_streaming_decoder()
# Process tokens as they arrive
for token_id in token_stream:
if text := decoder.add_token(token_id):
print(text, end="", flush=True)
print(decoder.flush())
Why streaming decoders? BPE tokens don't align with UTF-8 character boundaries. A multi-byte character like "世" might split across tokens. The streaming decoder buffers incomplete sequences and only outputs complete characters.
See the API Guide for detailed usage, examples, and best practices.
| Vocabulary | Used By | Vocabulary Size | Special Tokens | Import Constant |
|---|---|---|---|---|
cl100k_base |
GPT-4, GPT-3.5-turbo | ~100,000 | 5 + 54 agent | CL100K_BASE_PATTERN |
o200k_base |
GPT-4o | ~200,000 | 2 + 54 agent | O200K_BASE_PATTERN |
llama3 |
Llama 3, 3.1, 3.2, 3.3 (Meta) | ~128,000 | 11 + 54 agent | LLAMA3_PATTERN |
deepseek_v3 |
DeepSeek V3, DeepSeek R1 | ~128,000 | 17 + 54 agent | LLAMA3_PATTERN |
mistral_v1 |
Mistral 7B v0.1/v0.2, Mixtral 8x7B | ~32,000 | 3 + 54 agent | SENTENCEPIECE_PATTERN |
mistral_v2 |
Mistral 7B v0.3, Codestral, 8x22B | ~32,768 | 10 + 54 agent | SENTENCEPIECE_PATTERN |
mistral_v3 |
Mistral NeMo, Large 2, Pixtral | ~131,000 | 10 + 54 agent | MISTRAL_V3_PATTERN |
OpenAI standard tokens:
<|endoftext|>, <|fim_prefix|>, <|fim_middle|>, <|fim_suffix|>, <|endofprompt|><|endoftext|>, <|endofprompt|>Meta Llama 3 standard tokens:
<|begin_of_text|>, <|end_of_text|>, <|start_header_id|>, <|end_header_id|>, <|eot_id|>, <|eom_id|> (3.1+), <|python_tag|> (3.1+), <|step_id|> (3.2-Vision), <|image|> (3.2-Vision)DeepSeek V3 standard tokens:
<|begin▁of▁sentence|>, <|end▁of▁sentence|>, <think>, </think>, <|User|>, <|Assistant|>, <|EOT|>, FIM tokens (<|fim▁hole|>, <|fim▁begin|>, <|fim▁end|>), tool calling tokens (<|tool▁calls▁begin|>, <|tool▁call▁begin|>, etc.)Mistral standard tokens:
<unk>, <s>, </s> (SentencePiece native)[INST], [/INST], [TOOL_CALLS], [AVAILABLE_TOOLS], [/AVAILABLE_TOOLS], [TOOL_RESULTS], [/TOOL_RESULTS]<unk>, <s>, </s> + control tokens (Tekken/Tiktoken-based, NOT SentencePiece)Splintr extends all vocabularies with 54 specialized tokens for building agent systems:
from splintr import Tokenizer, CL100K_AGENT_TOKENS
tokenizer = Tokenizer.from_pretrained("cl100k_base")
text = "<|think|>Let me reason...<|/think|>The answer is 42."
tokens = tokenizer.encode_with_special(text)
print(CL100K_AGENT_TOKENS.THINK) # 100282
print(CL100K_AGENT_TOKENS.FUNCTION) # 100292
| Category | Example Tokens | Purpose |
|---|---|---|
| Conversation | system, user, assistant, im_start, im_end |
ChatML format |
| Thinking | think |
Chain-of-Thought reasoning |
| ReAct | plan, step, act, observe |
Agent action loops |
| Tools | function, result, error |
Function calling |
| RAG | context, quote, cite, source |
Citations |
See docs/special_tokens.md for the complete list and API Guide for usage examples.
Splintr implements several optimizations that make tokenization faster:
LLM Applications:
Agent Systems:
Training Pipelines:
RAG Applications:
Data Processing:
Contributions are welcome! Here's how you can help:
cargo test and cargo clippy before submitting# Clone the repository
git clone https://github.com/ml-rust/splintr.git
cd splintr
# Install pre-commit hook (recommended)
cp hooks/pre-commit .git/hooks/pre-commit
chmod +x .git/hooks/pre-commit
# Build the Rust library
cargo build --release
# Build Python bindings
pip install maturin
maturin develop --release
# Run tests
cargo test # Rust tests
cargo clippy --all-targets # Linting
cargo fmt --all --check # Format check
The pre-commit hook automatically runs formatting, clippy, and tests before each commit.
Splintr builds upon concepts from:
The performance optimizations are informed by profiling real-world usage patterns in LLM applications.
If you use Splintr in your research, please cite:
@software{splintr,
author = {Farhan Syah},
title = {Splintr: High-Performance BPE Tokenizer},
year = {2025},
url = {https://github.com/ml-rust/splintr}
}