| Crates.io | wetext-rs |
| lib.rs | wetext-rs |
| version | 0.1.2 |
| created_at | 2025-12-30 09:46:27.699231+00 |
| updated_at | 2025-12-30 09:46:27.699231+00 |
| description | Text normalization library for TTS, Rust implementation of WeText |
| homepage | |
| repository | https://github.com/SpenserCai/wetext-rs |
| max_upload_size | |
| id | 2012375 |
| size | 99,724 |
A Rust implementation of WeText for text normalization in TTS (Text-to-Speech) applications.
This project is a Rust port of the Python wetext library, which provides a lightweight runtime for WeTextProcessing without depending on Pynini. The primary motivation for creating this Rust implementation is to:
The original Python implementation uses kaldifst for FST operations. This Rust version uses rustfst, a pure Rust implementation of OpenFST, to achieve the same functionality.
"2024年1月15日" → "二零二四年一月十五日""$100" → "one hundred dollars""一百二十三" → "123""don't" → "do not"Add to your Cargo.toml:
[dependencies]
wetext-rs = "0.1"
This library requires FST (Finite State Transducer) weight files for text normalization. The weight files can be downloaded from:
ModelScope: pengzhendong/wetext
Download the weight files and organize them in the following structure:
fsts/
├── traditional_to_simple.fst
├── full_to_half.fst
├── remove_interjections.fst
├── remove_puncts.fst
├── tag_oov.fst
├── en/
│ └── tn/
│ ├── tagger.fst
│ └── verbalizer.fst
├── zh/
│ ├── tn/
│ │ ├── tagger.fst
│ │ ├── verbalizer.fst
│ │ └── verbalizer_remove_erhua.fst
│ └── itn/
│ ├── tagger.fst
│ ├── tagger_enable_0_to_9.fst
│ └── verbalizer.fst
└── ja/
├── tn/
│ ├── tagger.fst
│ └── verbalizer.fst
└── itn/
├── tagger.fst
├── tagger_enable_0_to_9.fst
└── verbalizer.fst
Option 1: ModelScope CLI
pip install modelscope
modelscope download --model pengzhendong/wetext --local_dir ./fsts
Option 2: Git LFS
git lfs install
git clone https://www.modelscope.cn/pengzhendong/wetext.git fsts
use wetext_rs::{Normalizer, NormalizerConfig, Language, Operator};
// Create normalizer with default settings (Chinese TN, auto language detection)
let mut normalizer = Normalizer::with_defaults("path/to/fsts");
// Normalize text
let result = normalizer.normalize("2024年1月15日").unwrap();
println!("{}", result); // 二零二四年一月十五日
use wetext_rs::{Normalizer, NormalizerConfig, Language, Operator};
// Configure for specific language and operation
let config = NormalizerConfig::new()
.with_lang(Language::Zh)
.with_operator(Operator::Tn)
.with_fix_contractions(true)
.with_traditional_to_simple(true);
let mut normalizer = Normalizer::new("path/to/fsts", config);
let result = normalizer.normalize("100元").unwrap();
println!("{}", result); // 一百元
use wetext_rs::{Normalizer, NormalizerConfig, Language, Operator};
let config = NormalizerConfig::new()
.with_lang(Language::Zh)
.with_operator(Operator::Itn);
let mut normalizer = Normalizer::new("path/to/fsts", config);
let result = normalizer.normalize("一百二十三").unwrap();
println!("{}", result); // 123
use wetext_rs::normalize;
let result = normalize("path/to/fsts", "123").unwrap();
println!("{}", result); // 幺二三
| Option | Default | Description |
|---|---|---|
lang |
Auto |
Language: Auto, En, Zh, Ja |
operator |
Tn |
Operation: Tn (text normalization), Itn (inverse) |
fix_contractions |
false |
Expand English contractions |
traditional_to_simple |
false |
Convert Traditional to Simplified Chinese |
full_to_half |
false |
Convert full-width to half-width characters |
remove_interjections |
false |
Remove interjections (e.g., "嗯", "啊") |
remove_puncts |
false |
Remove punctuation marks |
tag_oov |
false |
Tag out-of-vocabulary words |
enable_0_to_9 |
false |
Enable 0-9 digit conversion in ITN |
remove_erhua |
false |
Remove erhua (儿化音) |
| Input | Output |
|---|---|
123 |
幺二三 |
2024年 |
二零二四年 |
2024年1月15日 |
二零二四年一月十五日 |
下午3点30分 |
下午三点三十分 |
100元 |
一百元 |
3/4 |
四分之三 |
1.5 |
一点五 |
| Input | Output |
|---|---|
一百二十三 |
123 |
二零二四年 |
2024年 |
一点五 |
1.5 |
| Input | Output |
|---|---|
$100 |
one hundred dollars |
January 15, 2024 |
january fifteenth twenty twenty four |
3.14 |
three point one four |
| Input | Output |
|---|---|
100円 |
百円 |
2024年 |
二千二十四年 |
3月15日 |
三月十五日 |
| Crate | Purpose |
|---|---|
| rustfst | FST operations (Rust implementation of OpenFST) |
| thiserror | Error handling |
| regex | Regular expressions |
| once_cell | Lazy initialization |
| serde_json | JSON parsing |
This Rust implementation is designed to be compatible with the Python wetext library. The core TN/ITN functionality produces identical results for the same inputs.
Differences from Python version:
| Aspect | Python wetext | Rust wetext-rs |
|---|---|---|
| Language detection | Chinese/English only | Adds Japanese detection (via Hiragana/Katakana) |
| Contractions | Runtime loaded | Compile-time embedded |
| Error handling | Python exceptions | Result<T, WeTextError> |
| FST library | kaldifst | rustfst |
# Run all unit and integration tests
cargo test
# Run with verbose output
cargo test -- --nocapture
To verify that the Rust implementation produces identical results to the Python version:
cd tests
python3.13 -m venv venv
source venv/bin/activate
pip install wetext
python tests/generate_reference.py
This creates tests/reference_outputs.json with expected outputs from Python wetext.
cargo test test_compare_with_python -- --ignored --nocapture
Expected output:
✓ PASS: '123' (zh/tn) => '幺二三'
✓ PASS: '2024年1月15日' (zh/tn) => '二零二四年一月十五日'
...
Results: 20 passed, 0 failed
# Run clippy linter
cargo clippy -- -D warnings
# Format code
cargo fmt
# Check formatting
cargo fmt -- --check
This project is licensed under the Apache-2.0 License.