| Crates.io | tekken-rs |
| lib.rs | tekken-rs |
| version | 0.1.1 |
| created_at | 2025-07-25 07:36:35.480198+00 |
| updated_at | 2025-07-28 07:59:37.915259+00 |
| description | Rust implementation of Mistral Tekken tokenizer with audio support |
| homepage | https://github.com/jorge-menjivar/tekken-rs |
| repository | https://github.com/jorge-menjivar/tekken-rs |
| max_upload_size | |
| id | 1767287 |
| size | 134,393 |
A Rust implementation of the Mistral Tekken tokenizer with audio support. This library provides fast and efficient tokenization capabilities for text and audio data, fully compatible with Mistral AI's tokenizer.
Add this to your Cargo.toml:
[dependencies]
tekken = "0.1.0"
Or use the Git repository directly:
[dependencies]
tekken = { git = "https://github.com/jorge-menjivar/tekken-rs" }
use tekken::tekkenizer::Tekkenizer;
use tekken::special_tokens::SpecialTokenPolicy;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Load tokenizer
let tokenizer = Tekkenizer::from_file("tekken.json")?;
// Encode text
let text = "Hello, world!";
let tokens = tokenizer.encode(text, true, true)?; // add_bos=true, add_eos=true
// Decode tokens
let decoded = tokenizer.decode(&tokens, SpecialTokenPolicy::Keep)?;
println!("Original: {}", text);
println!("Tokens: {:?}", tokens);
println!("Decoded: {}", decoded);
Ok(())
}
use tekken::audio::{Audio, AudioConfig, AudioSpectrogramConfig, AudioEncoder};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Load audio
let audio = Audio::from_file("audio.wav")?;
// Create audio configuration
let spectrogram_config = AudioSpectrogramConfig::new(80, 160, 400)?;
let audio_config = AudioConfig::new(16000, 12.5, spectrogram_config, None)?;
// Encode audio to tokens
let encoder = AudioEncoder::new(audio_config, 1000, 1001); // audio_token_id, begin_audio_token_id
let encoding = encoder.encode(audio)?;
println!("Audio encoded to {} tokens", encoding.tokens.len());
Ok(())
}
Run the examples to see the tokenizer in action:
# Basic tokenizer test
cargo run --example basic_tokenizer_test
# Audio processing test
cargo run --bin test_audio
Run the test suite:
cargo test
The tokenizer consists of several key components:
tokenizer.rs: Main tokenizer implementationaudio.rs: Audio processing and encoding functionalityspecial_tokens.rs: Special token definitions and handlingconfig.rs: Configuration structureserrors.rs: Error handlingThe audio implementation includes:
This Rust implementation is designed to be fully compatible with the Python version:
tekken-rs/
├── src/
│ ├── lib.rs # Library entry point
│ ├── tokenizer.rs # Main tokenizer implementation
│ ├── audio.rs # Audio processing functionality
│ ├── special_tokens.rs # Special token definitions
│ ├── config.rs # Configuration structures
│ └── errors.rs # Error types
├── examples/ # Example usage
├── tests/ # Integration tests
└── benches/ # Performance benchmarks
The Rust implementation provides significant performance improvements over the Python version:
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to:
cargo fmt and cargo clippy before submittingSee CONTRIBUTING.md for detailed guidelines.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
This is an original Rust implementation designed to be compatible with Mistral AI's Tekken tokenizer format.
See NOTICE file for detailed attribution.