| Crates.io | trustformers-tokenizers |
| lib.rs | trustformers-tokenizers |
| version | 0.1.0-alpha.1 |
| created_at | 2025-11-09 10:12:35.940927+00 |
| updated_at | 2025-11-09 10:12:35.940927+00 |
| description | Tokenizers for TrustformeRS |
| homepage | |
| repository | https://github.com/cool-japan/trustformers |
| max_upload_size | |
| id | 1923936 |
| size | 3,198,532 |
High-performance tokenization library for transformer models with support for multiple tokenization algorithms.
This crate provides production-ready tokenizer implementations including BPE (Byte-Pair Encoding), WordPiece, and SentencePiece tokenizers. It's designed to be fast, memory-efficient, and compatible with popular tokenizer formats.
use trustformers_tokenizers::{
tokenizer::Tokenizer,
models::bpe::BPE,
pre_tokenizers::whitespace::Whitespace,
processors::template::TemplateProcessing,
};
// Create a tokenizer
let mut tokenizer = Tokenizer::new(BPE::default());
// Add pre-tokenizer
tokenizer.with_pre_tokenizer(Whitespace::default());
// Add post-processor for BERT-style tokens
tokenizer.with_post_processor(
TemplateProcessing::builder()
.single("[CLS] $A [SEP]")
.pair("[CLS] $A [SEP] $B [SEP]")
.build()
);
// Tokenize text
let encoding = tokenizer.encode("Hello, world!", true)?;
println!("Tokens: {:?}", encoding.get_tokens());
println!("IDs: {:?}", encoding.get_ids());
use trustformers_tokenizers::tokenizer::Tokenizer;
// Load from file
let tokenizer = Tokenizer::from_file("path/to/tokenizer.json")?;
// Load from Hugging Face format
let tokenizer = Tokenizer::from_pretrained("bert-base-uncased")?;
// Tokenize with offsets
let encoding = tokenizer.encode_with_offsets("Hello world!", true)?;
for (token, (start, end)) in encoding.get_tokens().iter()
.zip(encoding.get_offsets()) {
println!("{}: {}-{}", token, start, end);
}
let texts = vec![
"First sentence.",
"Second sentence is longer.",
"Third one.",
];
let encodings = tokenizer.encode_batch(&texts, true)?;
// Pad to same length
let padded = tokenizer.pad_batch(&mut encodings, None)?;
trustformers-tokenizers/
├── src/
│ ├── tokenizer/ # Main tokenizer interface
│ ├── models/ # Tokenization algorithms
│ │ ├── bpe/ # BPE implementation
│ │ ├── wordpiece/ # WordPiece implementation
│ │ └── unigram/ # SentencePiece unigram
│ ├── pre_tokenizers/ # Pre-processing steps
│ ├── normalizers/ # Text normalization
│ ├── processors/ # Post-processing
│ └── decoders/ # Token-to-text decoding
| Tokenizer | Text Size | Time (ms) | Throughput (MB/s) |
|---|---|---|---|
| BPE | 1KB | 0.12 | 8.3 |
| BPE | 1MB | 45 | 22.2 |
| WordPiece | 1KB | 0.15 | 6.7 |
| WordPiece | 1MB | 52 | 19.2 |
| SentencePiece | 1KB | 0.18 | 5.6 |
| SentencePiece | 1MB | 61 | 16.4 |
Benchmarks on Apple M1, single-threaded
use trustformers_tokenizers::{
models::bpe::{BPE, BpeTrainer},
tokenizer::Tokenizer,
};
// Configure trainer
let mut trainer = BpeTrainer::builder()
.vocab_size(30000)
.min_frequency(2)
.special_tokens(vec![
"[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"
])
.build();
// Train from files
let files = vec!["data/corpus.txt"];
tokenizer.train(&files, trainer)?;
// Save trained tokenizer
tokenizer.save("my_tokenizer.json", false)?;
tokenizers library.model files directlytrustformers-pytrustformers-wasmuse trustformers_tokenizers::pre_tokenizers::{
PreTokenizer, PreTokenizedString,
};
struct CustomPreTokenizer;
impl PreTokenizer for CustomPreTokenizer {
fn pre_tokenize(&self, pretok: &mut PreTokenizedString) -> Result<()> {
// Custom splitting logic
pretok.split(|c| c.is_whitespace(), SplitDelimiterBehavior::Remove)?;
Ok(())
}
}
MIT OR Apache-2.0