wetext-rs

Crates.iowetext-rs
lib.rswetext-rs
version0.1.2
created_at2025-12-30 09:46:27.699231+00
updated_at2025-12-30 09:46:27.699231+00
descriptionText normalization library for TTS, Rust implementation of WeText
homepage
repositoryhttps://github.com/SpenserCai/wetext-rs
max_upload_size
id2012375
size99,724
(SpenserCai)

documentation

README

wetext-rs

Crates.io Documentation License Rust

A Rust implementation of WeText for text normalization in TTS (Text-to-Speech) applications.


Table of Contents


Background

This project is a Rust port of the Python wetext library, which provides a lightweight runtime for WeTextProcessing without depending on Pynini. The primary motivation for creating this Rust implementation is to:

  1. Integrate with the Candle ecosystem - Enable seamless integration with Rust-based ML frameworks like Candle, eliminating Python dependencies in production deployments
  2. Improve performance - Leverage Rust's memory safety and zero-cost abstractions for faster text processing
  3. Enable standalone deployment - Create a single binary that can be deployed without Python runtime

The original Python implementation uses kaldifst for FST operations. This Rust version uses rustfst, a pure Rust implementation of OpenFST, to achieve the same functionality.


Features

  • Text Normalization (TN): Convert numbers, dates, currency to spoken form
    • "2024年1月15日""二零二四年一月十五日"
    • "$100""one hundred dollars"
  • Inverse Text Normalization (ITN): Convert spoken form back to written form
    • "一百二十三""123"
  • Multi-language support: Chinese (zh), English (en), Japanese (ja)
  • English contractions expansion: "don't""do not"
  • Various text preprocessing options:
    • Traditional to Simplified Chinese conversion
    • Full-width to half-width character conversion
    • Interjection removal
    • Punctuation removal
    • Erhua (儿化音) removal

Installation

Add to your Cargo.toml:

[dependencies]
wetext-rs = "0.1"

FST Weight Files

This library requires FST (Finite State Transducer) weight files for text normalization. The weight files can be downloaded from:

ModelScope: pengzhendong/wetext

Download the weight files and organize them in the following structure:

📁 Click to expand directory structure
fsts/
├── traditional_to_simple.fst
├── full_to_half.fst
├── remove_interjections.fst
├── remove_puncts.fst
├── tag_oov.fst
├── en/
│   └── tn/
│       ├── tagger.fst
│       └── verbalizer.fst
├── zh/
│   ├── tn/
│   │   ├── tagger.fst
│   │   ├── verbalizer.fst
│   │   └── verbalizer_remove_erhua.fst
│   └── itn/
│       ├── tagger.fst
│       ├── tagger_enable_0_to_9.fst
│       └── verbalizer.fst
└── ja/
    ├── tn/
    │   ├── tagger.fst
    │   └── verbalizer.fst
    └── itn/
        ├── tagger.fst
        ├── tagger_enable_0_to_9.fst
        └── verbalizer.fst

Download Options

Option 1: ModelScope CLI

pip install modelscope
modelscope download --model pengzhendong/wetext --local_dir ./fsts

Option 2: Git LFS

git lfs install
git clone https://www.modelscope.cn/pengzhendong/wetext.git fsts

Usage

Basic Usage

use wetext_rs::{Normalizer, NormalizerConfig, Language, Operator};

// Create normalizer with default settings (Chinese TN, auto language detection)
let mut normalizer = Normalizer::with_defaults("path/to/fsts");

// Normalize text
let result = normalizer.normalize("2024年1月15日").unwrap();
println!("{}", result);  // 二零二四年一月十五日

With Configuration

use wetext_rs::{Normalizer, NormalizerConfig, Language, Operator};

// Configure for specific language and operation
let config = NormalizerConfig::new()
    .with_lang(Language::Zh)
    .with_operator(Operator::Tn)
    .with_fix_contractions(true)
    .with_traditional_to_simple(true);

let mut normalizer = Normalizer::new("path/to/fsts", config);
let result = normalizer.normalize("100元").unwrap();
println!("{}", result);  // 一百元

Inverse Text Normalization (ITN)

use wetext_rs::{Normalizer, NormalizerConfig, Language, Operator};

let config = NormalizerConfig::new()
    .with_lang(Language::Zh)
    .with_operator(Operator::Itn);

let mut normalizer = Normalizer::new("path/to/fsts", config);
let result = normalizer.normalize("一百二十三").unwrap();
println!("{}", result);  // 123

Convenience Function

use wetext_rs::normalize;

let result = normalize("path/to/fsts", "123").unwrap();
println!("{}", result);  // 幺二三

Configuration Options

Option Default Description
lang Auto Language: Auto, En, Zh, Ja
operator Tn Operation: Tn (text normalization), Itn (inverse)
fix_contractions false Expand English contractions
traditional_to_simple false Convert Traditional to Simplified Chinese
full_to_half false Convert full-width to half-width characters
remove_interjections false Remove interjections (e.g., "嗯", "啊")
remove_puncts false Remove punctuation marks
tag_oov false Tag out-of-vocabulary words
enable_0_to_9 false Enable 0-9 digit conversion in ITN
remove_erhua false Remove erhua (儿化音)

Examples

Chinese Text Normalization

Input Output
123 幺二三
2024年 二零二四年
2024年1月15日 二零二四年一月十五日
下午3点30分 下午三点三十分
100元 一百元
3/4 四分之三
1.5 一点五

Chinese Inverse Text Normalization

Input Output
一百二十三 123
二零二四年 2024年
一点五 1.5

English Text Normalization

Input Output
$100 one hundred dollars
January 15, 2024 january fifteenth twenty twenty four
3.14 three point one four

Japanese Text Normalization

Input Output
100円 百円
2024年 二千二十四年
3月15日 三月十五日

Dependencies

Crate Purpose
rustfst FST operations (Rust implementation of OpenFST)
thiserror Error handling
regex Regular expressions
once_cell Lazy initialization
serde_json JSON parsing

Compatibility with Python WeText

This Rust implementation is designed to be compatible with the Python wetext library. The core TN/ITN functionality produces identical results for the same inputs.

Differences from Python version:

Aspect Python wetext Rust wetext-rs
Language detection Chinese/English only Adds Japanese detection (via Hiragana/Katakana)
Contractions Runtime loaded Compile-time embedded
Error handling Python exceptions Result<T, WeTextError>
FST library kaldifst rustfst

Development

Running Tests

# Run all unit and integration tests
cargo test

# Run with verbose output
cargo test -- --nocapture

Consistency Testing with Python WeText

To verify that the Rust implementation produces identical results to the Python version:

🧪 Click to expand testing instructions
  1. Setup Python environment (Python 3.13 recommended):
cd tests
python3.13 -m venv venv
source venv/bin/activate
pip install wetext
  1. Generate reference outputs from Python:
python tests/generate_reference.py

This creates tests/reference_outputs.json with expected outputs from Python wetext.

  1. Run comparison tests:
cargo test test_compare_with_python -- --ignored --nocapture

Expected output:

✓ PASS: '123' (zh/tn) => '幺二三'
✓ PASS: '2024年1月15日' (zh/tn) => '二零二四年一月十五日'
...
Results: 20 passed, 0 failed

Code Quality

# Run clippy linter
cargo clippy -- -D warnings

# Format code
cargo fmt

# Check formatting
cargo fmt -- --check

Credits


License

This project is licensed under the Apache-2.0 License.

Commit count: 0

cargo fmt