| Crates.io | voirs-recognizer |
| lib.rs | voirs-recognizer |
| version | 0.1.0-alpha.1 |
| created_at | 2025-09-21 04:43:30.25396+00 |
| updated_at | 2025-09-21 04:43:30.25396+00 |
| description | Voice recognition and analysis capabilities for VoiRS |
| homepage | https://github.com/cool-japan/voirs |
| repository | https://github.com/cool-japan/voirs |
| max_upload_size | |
| id | 1848450 |
| size | 3,845,174 |
Automatic Speech Recognition (ASR) and phoneme alignment for VoiRS
VoiRS Recognizer provides comprehensive speech recognition capabilities for the VoiRS ecosystem, enabling accurate transcription, phoneme alignment, and audio analysis for speech synthesis evaluation and training.
Add VoiRS Recognizer to your Cargo.toml:
[dependencies]
voirs-recognizer = "0.1.0"
# Enable specific ASR models
voirs-recognizer = { version = "0.1.0", features = ["whisper", "forced-align"] }
use voirs_recognizer::prelude::*;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize ASR system with Whisper
let asr = WhisperASR::new().await?;
// Load audio file
let audio = AudioBuffer::from_file("speech.wav")?;
// Recognize speech
let transcript = asr.recognize(&audio, None).await?;
println!("Transcript: {}", transcript.text);
// Get word-level timestamps
for word in &transcript.words {
println!("{}: {:.2}s - {:.2}s", word.word, word.start, word.end);
}
Ok(())
}
use voirs_recognizer::prelude::*;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize phoneme recognizer
let recognizer = MFARecognizer::new().await?;
// Align audio with text
let audio = AudioBuffer::from_file("speech.wav")?;
let text = "Hello world, this is a test.";
let alignment = recognizer.align_phonemes(&audio, text, None).await?;
// Print phoneme-level alignment
for phoneme in &alignment.phonemes {
println!("{}: {:.3}s - {:.3}s (confidence: {:.2})",
phoneme.phoneme.symbol,
phoneme.start_time,
phoneme.end_time,
phoneme.confidence);
}
Ok(())
}
use voirs_recognizer::prelude::*;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize audio analyzer
let analyzer = AudioAnalyzer::new().await?;
// Analyze audio quality
let audio = AudioBuffer::from_file("speech.wav")?;
let analysis = analyzer.analyze_quality(&audio, None).await?;
println!("SNR: {:.1} dB", analysis.snr);
println!("THD: {:.2}%", analysis.thd);
println!("Spectral centroid: {:.1} Hz", analysis.spectral_centroid);
// Check for artifacts
if analysis.clipping_detected {
println!("⚠️ Audio clipping detected");
}
Ok(())
}
Enable specific functionality through feature flags:
[dependencies]
voirs-recognizer = {
version = "0.1.0",
features = [
"whisper", # OpenAI Whisper support
"deepspeech", # Mozilla DeepSpeech support
"wav2vec2", # Facebook Wav2Vec2 support
"forced-align", # Basic forced alignment
"mfa", # Montreal Forced Alignment
"all-models", # Enable all ASR models
"gpu", # GPU acceleration support
]
}
VoiRS Recognizer is designed to meet strict performance requirements:
Use the built-in performance validator to ensure your configuration meets requirements:
use voirs_recognizer::prelude::*;
use std::time::Duration;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let validator = PerformanceValidator::new().with_verbose(true);
// Set custom requirements if needed
let requirements = PerformanceRequirements {
max_rtf: 0.25, // Stricter than default 0.3
max_memory_usage: 1_500_000_000, // 1.5GB instead of 2GB
max_startup_time_ms: 3000, // 3 seconds instead of 5
max_streaming_latency_ms: 150, // 150ms instead of 200ms
};
let custom_validator = PerformanceValidator::with_requirements(requirements);
// Validate your ASR system
let audio = AudioBuffer::from_file("test_audio.wav")?;
let startup_fn = || async {
let _asr = WhisperASR::new().await?;
Ok(())
};
let processing_time = Duration::from_millis(150); // Your actual processing time
let streaming_latency = Some(Duration::from_millis(120));
let validation = validator
.validate_comprehensive(&audio, startup_fn, processing_time, streaming_latency)
.await?;
if validation.passed {
println!("✅ All performance requirements met!");
println!("RTF: {:.3}", validation.metrics.rtf);
println!("Memory: {:.1} MB", validation.metrics.memory_usage as f64 / (1024.0 * 1024.0));
println!("Throughput: {:.0} samples/sec", validation.metrics.throughput_samples_per_sec);
} else {
println!("❌ Performance requirements not met");
for (test, passed) in &validation.test_results {
println!("{}: {}", test, if *passed { "PASS" } else { "FAIL" });
}
}
Ok(())
}
Choose the appropriate model size based on your performance requirements:
use voirs_recognizer::prelude::*;
// Ultra-fast processing (RTF ~0.1, lower accuracy)
let fast_config = WhisperConfig {
model_size: "tiny".to_string(),
compute_type: ComputeType::Int8, // Quantized for speed
beam_size: 1, // Greedy decoding
..Default::default()
};
// Balanced performance and accuracy (RTF ~0.3)
let balanced_config = WhisperConfig {
model_size: "base".to_string(),
compute_type: ComputeType::Float16,
beam_size: 3,
..Default::default()
};
// High accuracy (RTF ~0.8, higher latency)
let accurate_config = WhisperConfig {
model_size: "small".to_string(),
compute_type: ComputeType::Float32,
beam_size: 5,
temperature: 0.0, // Deterministic output
..Default::default()
};
Enable GPU acceleration for significant performance gains:
use voirs_recognizer::prelude::*;
// NVIDIA GPU acceleration
let cuda_config = WhisperConfig {
model_size: "base".to_string(),
device: Device::Cuda(0), // Use first GPU
compute_type: ComputeType::Float16, // FP16 for memory efficiency
memory_fraction: 0.8, // Use 80% of GPU memory
..Default::default()
};
// Apple Silicon GPU acceleration
let metal_config = WhisperConfig {
model_size: "base".to_string(),
device: Device::Metal,
compute_type: ComputeType::Float16,
..Default::default()
};
// CPU with optimizations
let optimized_cpu_config = WhisperConfig {
model_size: "tiny".to_string(),
device: Device::Cpu,
num_threads: num_cpus::get(), // Use all CPU cores
compute_type: ComputeType::Int8, // Quantization for speed
..Default::default()
};
Reduce memory usage with these techniques:
use voirs_recognizer::prelude::*;
// Memory-efficient configuration
let memory_config = WhisperConfig {
model_size: "tiny".to_string(), // Smallest model
compute_type: ComputeType::Int8, // 8-bit quantization
enable_memory_pooling: true, // Reuse memory allocations
cache_size_mb: 100, // Limit cache size
..Default::default()
};
// Enable dynamic quantization for further memory savings
let quantized_config = WhisperConfig {
model_size: "base".to_string(),
compute_type: ComputeType::Dynamic, // Dynamic quantization
quantization_mode: QuantizationMode::Int8,
..Default::default()
};
Configure for ultra-low latency streaming:
use voirs_recognizer::prelude::*;
use voirs_recognizer::integration::config::{StreamingConfig, LatencyMode};
// Ultra-low latency configuration
let streaming_config = StreamingConfig {
latency_mode: LatencyMode::UltraLow,
chunk_size: 1600, // 100ms chunks at 16kHz
overlap: 400, // 25ms overlap
buffer_duration: 2.0, // 2 second buffer
vad_enabled: true, // Voice activity detection
noise_suppression: true, // Real-time noise reduction
echo_cancellation: false, // Disable for lowest latency
};
// Balanced latency/accuracy configuration
let balanced_streaming = StreamingConfig {
latency_mode: LatencyMode::Balanced,
chunk_size: 4800, // 300ms chunks
overlap: 800, // 50ms overlap
buffer_duration: 5.0, // 5 second buffer
vad_enabled: true,
noise_suppression: true,
echo_cancellation: true,
};
// Create streaming ASR with optimized config
let streaming_asr = StreamingASR::with_config(streaming_config).await?;
Process multiple files efficiently:
use voirs_recognizer::prelude::*;
// Optimal batch processing
let batch_config = BatchProcessingConfig {
batch_size: 8, // Process 8 files at once
max_memory_usage: 1_500_000_000, // 1.5GB memory limit
parallel_workers: 4, // Use 4 worker threads
enable_gpu_batching: true, // Batch GPU operations
..Default::default()
};
let audio_files = vec![
"file1.wav", "file2.wav", "file3.wav", "file4.wav",
"file5.wav", "file6.wav", "file7.wav", "file8.wav",
];
// Load audio files
let audio_buffers: Vec<AudioBuffer> = audio_files
.iter()
.map(|path| AudioBuffer::from_file(path))
.collect::<Result<Vec<_>, _>>()?;
// Process batch efficiently
let batch_processor = BatchProcessor::with_config(batch_config).await?;
let transcripts = batch_processor.process_batch(&audio_buffers).await?;
// Results are returned in the same order as input
for (i, transcript) in transcripts.iter().enumerate() {
println!("File {}: {}", audio_files[i], transcript.text);
}
Monitor your application's performance in real-time:
use voirs_recognizer::prelude::*;
use std::time::Instant;
// Enable performance monitoring
let monitor = PerformanceMonitor::new()
.with_metrics_collection(true)
.with_real_time_reporting(true);
// Monitor ASR performance
let start = Instant::now();
let audio = AudioBuffer::from_file("speech.wav")?;
let transcript = asr.recognize(&audio, None).await?;
let processing_time = start.elapsed();
// Validate performance
let validator = PerformanceValidator::new().with_verbose(true);
let (rtf, rtf_passed) = validator.validate_rtf(&audio, processing_time);
let (memory_usage, memory_passed) = validator.estimate_memory_usage()?;
println!("Performance Metrics:");
println!(" RTF: {:.3} ({})", rtf, if rtf_passed { "✅" } else { "❌" });
println!(" Memory: {:.1} MB ({})",
memory_usage as f64 / (1024.0 * 1024.0),
if memory_passed { "✅" } else { "❌" });
println!(" Processing time: {:?}", processing_time);
println!(" Audio duration: {:.2}s", audio.duration());
// Log performance metrics for analysis
monitor.log_performance_metrics(&PerformanceMetrics {
rtf,
memory_usage,
startup_time_ms: 0, // Already started
streaming_latency_ms: 0, // Not applicable for batch
throughput_samples_per_sec: validator.calculate_throughput(audio.samples().len(), processing_time),
cpu_utilization: (processing_time.as_secs_f32() / audio.duration() * 100.0).min(100.0),
});
VoiRS automatically detects and uses SIMD instructions:
No manual configuration required - optimizations are applied automatically.
Optimize thread usage for your hardware:
use voirs_recognizer::prelude::*;
// Automatic thread optimization
let thread_config = ThreadingConfig::auto_detect();
// Manual thread configuration
let manual_config = ThreadingConfig {
num_inference_threads: num_cpus::get(), // All cores for inference
num_preprocessing_threads: 2, // 2 cores for preprocessing
enable_thread_affinity: true, // Pin threads to cores
prefer_performance_cores: true, // Use P-cores on hybrid CPUs
};
let asr = WhisperASR::with_threading_config(manual_config).await?;
Common performance problems and solutions:
// If memory usage exceeds limits, try:
let low_memory_config = WhisperConfig {
model_size: "tiny".to_string(), // Use smaller model
compute_type: ComputeType::Int8, // Enable quantization
cache_size_mb: 50, // Reduce cache size
enable_memory_pooling: true, // Reuse allocations
gradient_checkpointing: true, // Trade compute for memory
..Default::default()
};
// If RTF is too high, try:
let fast_config = WhisperConfig {
model_size: "tiny".to_string(), // Smallest/fastest model
beam_size: 1, // Greedy decoding
compute_type: ComputeType::Int8, // Quantization for speed
enable_cuda: true, // GPU acceleration if available
num_threads: num_cpus::get(), // Use all CPU cores
..Default::default()
};
// If streaming latency is too high, try:
let low_latency_config = StreamingConfig {
latency_mode: LatencyMode::UltraLow,
chunk_size: 800, // Smaller chunks (50ms at 16kHz)
overlap: 160, // Minimal overlap (10ms)
buffer_duration: 1.0, // Smaller buffer
preprocessing_enabled: false, // Disable preprocessing
..Default::default()
};
VoiRS Recognizer supports multiple languages through its ASR backends:
| Language | Whisper | DeepSpeech | Wav2Vec2 | MFA |
|---|---|---|---|---|
| English | ✅ | ✅ | ✅ | ✅ |
| Spanish | ✅ | ❌ | ❌ | ✅ |
| French | ✅ | ❌ | ❌ | ✅ |
| German | ✅ | ❌ | ❌ | ✅ |
| Japanese | ✅ | ❌ | ❌ | ❌ |
| Chinese | ✅ | ❌ | ❌ | ❌ |
| Korean | ✅ | ❌ | ❌ | ❌ |
use voirs_recognizer::prelude::*;
let config = ASRConfig {
// Model selection
preferred_models: vec![ASRBackend::Whisper, ASRBackend::DeepSpeech],
// Language settings
language: Some(LanguageCode::EnUs),
auto_detect_language: true,
// Quality settings
enable_vad: true, // Voice Activity Detection
noise_suppression: true, // Noise reduction
// Performance settings
chunk_duration_ms: 30000, // 30 second chunks
overlap_duration_ms: 1000, // 1 second overlap
// Output settings
include_word_timestamps: true,
include_confidence_scores: true,
normalize_text: true,
};
let asr = ASRSystem::with_config(config).await?;
use voirs_recognizer::prelude::*;
let config = PhonemeConfig {
// Alignment precision
time_resolution_ms: 10, // 10ms resolution
confidence_threshold: 0.5, // Minimum confidence
// Language model
acoustic_model: "english_us".to_string(),
pronunciation_dict: "cmudict".to_string(),
// Processing options
enable_speaker_adaptation: true,
enable_pronunciation_variants: true,
};
let recognizer = PhonemeRecognizer::with_config(config).await?;
VoiRS Recognizer provides comprehensive error handling:
use voirs_recognizer::prelude::*;
match asr.recognize(&audio, None).await {
Ok(transcript) => {
println!("Success: {}", transcript.text);
}
Err(ASRError::ModelNotFound { model }) => {
eprintln!("Model not available: {}", model);
}
Err(ASRError::AudioTooShort { duration }) => {
eprintln!("Audio too short: {:.1}s", duration);
}
Err(ASRError::LanguageNotSupported { language }) => {
eprintln!("Language not supported: {:?}", language);
}
Err(e) => {
eprintln!("Recognition failed: {}", e);
}
}
Check out the examples directory for more comprehensive usage examples:
basic_recognition.rs - Simple speech recognitionphoneme_alignment.rs - Detailed phoneme alignmentaudio_analysis.rs - Audio quality analysisbatch_processing.rs - Efficient batch processingstreaming_recognition.rs - Real-time recognitionmultilingual.rs - Multi-language supportPerformance benchmarks on common datasets:
| Model | Dataset | WER | RTF | Memory |
|---|---|---|---|---|
| Whisper-base | LibriSpeech | 5.2% | 0.3x | 1.2GB |
| DeepSpeech | CommonVoice | 8.1% | 0.8x | 800MB |
| Wav2Vec2-base | LibriSpeech | 6.4% | 0.5x | 1.0GB |
RTF = Real Time Factor (processing time / audio duration)
GitHub Issues: For bug reports and feature requests
GitHub Discussions: For questions, ideas, and community chat
Discord Server: Real-time chat and support
#general, #help, #showcase, #developmentMatrix: Bridged with Discord for matrix users
For commercial deployments and enterprise support:
We welcome contributions! Please see our Contributing Guide for details.
# Clone the repository
git clone https://github.com/cool-japan/voirs.git
cd voirs/crates/voirs-recognizer
# Install dependencies
cargo build --all-features
# Run tests
cargo test --all-features
# Run benchmarks
cargo bench --all-features
This project is licensed under either of
at your option.
If you use VoiRS Recognizer in your research, please cite:
@software{voirs_recognizer,
title = {VoiRS Recognizer: Advanced Speech Recognition for Neural TTS},
author = {Tetsuya Kitahata},
organization = {Cool Japan Co., Ltd.},
year = {2024},
url = {https://github.com/cool-japan/voirs}
}