loqa-voice-dsp

Crates.ioloqa-voice-dsp
lib.rsloqa-voice-dsp
version0.5.0
created_at2025-11-21 04:19:38.408332+00
updated_at2025-12-13 00:02:17.135149+00
descriptionShared DSP library for voice analysis (pitch, formants, spectral features)
homepagehttps://github.com/loqalabs/loqa
repositoryhttps://github.com/loqalabs/loqa
max_upload_size
id1943118
size358,458
Anna Barnes (annabarnes1138)

documentation

https://docs.rs/loqa-voice-dsp

README

loqa-voice-dsp

Shared DSP library for voice analysis, providing core digital signal processing functionality for both Loqa backend and VoiceFind mobile app.

Features

  • Pitch Detection: YIN and pYIN algorithms for fundamental frequency (F0) estimation
    • NEW in v0.3.0: Stateful VoiceAnalyzer API for streaming analysis
    • NEW in v0.3.0: pYIN (probabilistic YIN) for better noisy/breathy voice detection
    • NEW in v0.3.0: Configurable algorithm selection (Auto/pYIN/YIN/Autocorr)
    • NEW in v0.4.0: Custom pYIN implementation optimized for voice (no external dependencies)
  • Formant Extraction: Linear Predictive Coding (LPC) for formant analysis
  • FFT Utilities: Fast Fourier Transform for spectral analysis
  • Spectral Analysis: Spectral centroid, tilt, and rolloff calculations
  • HNR (Harmonics-to-Noise Ratio): Breathiness measurement using Boersma's autocorrelation method
  • H1-H2 Amplitude Difference: Vocal weight analysis (lighter vs fuller voice quality)

Installation

iOS (CocoaPods)

Add to your Podfile:

pod 'LoqaVoiceDSP', '~> 0.3.0'

Then run:

pod install

iOS (Swift Package Manager)

In Xcode:

  1. File → Add Packages
  2. Enter repository URL: https://github.com/loqalabs/loqa
  3. Select version: 0.3.0 or later

Or add to Package.swift:

dependencies: [
    .package(url: "https://github.com/loqalabs/loqa", from: "0.3.0")
]

Rust (Cargo)

Add to your Cargo.toml:

[dependencies]
loqa-voice-dsp = "0.3.0"

Usage

Buffer Size Recommendations

Pitch detection algorithms analyze buffers in frames. For best results:

  • Recommended: 2048-4096 samples (~46-93ms @ 44100 Hz)
  • Minimum: 1024 samples for real-time applications
  • Maximum: 4096 samples per frame (accuracy degrades beyond this)

Why this matters:

  • Large buffers (>4096) may contain pitch variations, voiced/unvoiced transitions, or multiple syllables
  • The algorithm requires buffer size ≥ 2× longest period for the lowest expected frequency:
    • For 80 Hz (male voice): minimum ~1103 samples
    • For 400 Hz (female voice): minimum ~221 samples

Frame-based analysis for long audio (v0.3.0+):

For buffers larger than 4096 samples, use the new VoiceAnalyzer API:

use loqa_voice_dsp::{VoiceAnalyzer, AnalysisConfig};

let config = AnalysisConfig::default()
    .with_frame_size(2048)
    .with_hop_size(1024);  // 50% overlap

let mut analyzer = VoiceAnalyzer::new(config)?;
let results = analyzer.process_stream(&long_audio_buffer);

Legacy approach (v0.2.x):

fn analyze_long_buffer(buffer: &[f32], sample_rate: u32) -> Vec<PitchResult> {
    const FRAME_SIZE: usize = 2048;
    const HOP_SIZE: usize = 1024;  // 50% overlap

    let mut results = Vec::new();
    for i in (0..buffer.len().saturating_sub(FRAME_SIZE)).step_by(HOP_SIZE) {
        let frame = &buffer[i..i + FRAME_SIZE];
        if let Ok(pitch) = detect_pitch(frame, sample_rate, 80.0, 400.0) {
            results.push(pitch);
        }
    }
    results
}

Rust (Loqa Backend)

New in v0.3.0 - Stateful API:

use loqa_voice_dsp::{VoiceAnalyzer, AnalysisConfig, PitchAlgorithm};

let audio_samples: Vec<f32> = /* your audio data */;

// Create analyzer with pYIN algorithm
let config = AnalysisConfig::default()
    .with_sample_rate(16000)
    .with_frame_size(2048)
    .with_algorithm(PitchAlgorithm::PYIN);

let mut analyzer = VoiceAnalyzer::new(config)?;

// Process single frame
let pitch = analyzer.process_frame(&audio_samples)?;
println!("Frequency: {} Hz", pitch.frequency);
println!("Confidence: {}", pitch.confidence);
println!("Voiced Probability: {}", pitch.voiced_probability);

// Or process a stream
let results = analyzer.process_stream(&long_audio_buffer);
for (i, pitch) in results.iter().enumerate() {
    println!("Frame {}: {} Hz (conf: {})", i, pitch.frequency, pitch.confidence);
}

Legacy API (still supported):

use loqa_voice_dsp::{detect_pitch, extract_formants, compute_fft, calculate_hnr, calculate_h1h2};

let audio_samples: Vec<f32> = /* your audio data */;
let sample_rate = 16000;

// Pitch detection (single-shot)
let pitch = detect_pitch(&audio_samples, sample_rate, 80.0, 400.0)?;
println!("Frequency: {} Hz, Confidence: {}", pitch.frequency, pitch.confidence);

// Formant extraction
let formants = extract_formants(&audio_samples, sample_rate, 14)?;
println!("F1: {} Hz, F2: {} Hz", formants.f1, formants.f2);

// HNR (breathiness)
let hnr = calculate_hnr(&audio_samples, sample_rate, 75.0, 500.0)?;
println!("HNR: {} dB, Voiced: {}", hnr.hnr, hnr.is_voiced);

// H1-H2 (vocal weight)
let h1h2 = calculate_h1h2(&audio_samples, sample_rate, Some(pitch.frequency))?;
println!("H1-H2: {} dB", h1h2.h1h2);

// FFT
let fft_result = compute_fft(&audio_samples, sample_rate, 2048)?;

iOS (Swift via FFI)

New in v0.3.0 - Stateful Analyzer:

// Create analyzer configuration
var config = loqa_analysis_config_default()
config.algorithm = 1  // 0=Auto, 1=PYIN, 2=YIN, 3=Autocorr
config.frame_size = 2048
config.sample_rate = 16000

// Create analyzer
let analyzer = loqa_voice_analyzer_new(config)
defer { loqa_voice_analyzer_free(analyzer) }  // Always free

// Process single frame
let pitchResult = samples.withUnsafeBufferPointer { buffer in
    loqa_voice_analyzer_process_frame(
        analyzer,
        buffer.baseAddress!,
        buffer.count
    )
}
if pitchResult.success {
    print("Pitch: \(pitchResult.frequency)Hz")
    print("Confidence: \(pitchResult.confidence)")
    print("Voiced Probability: \(pitchResult.voiced_probability)")
}

// Or process stream
var results = [PitchResultFFI](repeating: PitchResultFFI(), count: 100)
let count = samples.withUnsafeBufferPointer { buffer in
    results.withUnsafeMutableBufferPointer { resultsBuffer in
        loqa_voice_analyzer_process_stream(
            analyzer,
            buffer.baseAddress!,
            buffer.count,
            resultsBuffer.baseAddress!,
            100
        )
    }
}
print("Got \(count) pitch results")

Legacy API (still supported):

// Call C-compatible FFI functions
let samples: [Float] = /* your audio data */

// Pitch detection
let pitchResult = samples.withUnsafeBufferPointer { buffer in
    loqa_detect_pitch(
        buffer.baseAddress!,
        buffer.count,
        16000,  // sample rate
        80.0,   // min freq
        400.0   // max freq
    )
}
if pitchResult.success {
    print("Pitch: \(pitchResult.frequency)Hz, Confidence: \(pitchResult.confidence)")
}

// HNR (breathiness)
let hnrResult = samples.withUnsafeBufferPointer { buffer in
    loqa_calculate_hnr(
        buffer.baseAddress!,
        buffer.count,
        16000,  // sample rate
        75.0,   // min freq
        500.0   // max freq
    )
}
if hnrResult.success {
    print("HNR: \(hnrResult.hnr) dB, Voiced: \(hnrResult.is_voiced)")
}

// H1-H2 (vocal weight) - pass 0.0 for f0 to auto-detect
let h1h2Result = samples.withUnsafeBufferPointer { buffer in
    loqa_calculate_h1h2(
        buffer.baseAddress!,
        buffer.count,
        16000,  // sample rate
        pitchResult.frequency  // use detected pitch, or 0.0 to auto-detect
    )
}
if h1h2Result.success {
    print("H1-H2: \(h1h2Result.h1h2) dB")
}

Android (Java via JNI)

// Build with --features android-jni
import com.voicefind.VoiceFindDSP;

float[] audioSamples = /* your audio data */;
VoiceFindDSP.PitchResult pitch = VoiceFindDSP.detectPitch(
    audioSamples,
    16000,  // sample rate
    80.0f,  // min freq
    400.0f  // max freq
);

System.out.println("Frequency: " + pitch.frequency + " Hz");

Note: Android JNI requires building with --features android-jni

FFI Safety & Parameter Validation

FFI Safety Requirements

Critical: All FFI structs use #[repr(C)] to ensure C-compatible memory layout. Failure to maintain this can cause alignment issues and incorrect values (see historical issues #1, #2, #3).

Memory safety:

  • All FFI functions validate null pointers before dereferencing
  • FFT results (loqa_compute_fft) allocate memory that must be freed using loqa_free_fft_result
  • Never free FFT results more than once
  • Never use FFT result pointers after calling loqa_free_fft_result

Swift/iOS example with proper cleanup:

let fftResult = loqa_compute_fft(buffer, count, sampleRate, fftSize)
defer { loqa_free_fft_result(&fftResult) }  // Always free

if fftResult.success {
    let spectral = loqa_analyze_spectrum(&fftResult)
    // Use spectral features...
}

Parameter Validation Ranges

Important: All validation happens in the Rust core. Higher-level layers (Swift/TypeScript) should trust Rust validation rather than implementing their own rules.

Pitch Detection (loqa_detect_pitch)

Parameter Valid Range Recommended Notes
buffer_size ≥ 100 samples 2048-4096 samples See "Buffer Size Recommendations" above
sample_rate 8000-96000 Hz 16000-44100 Hz Higher rates support higher frequency ranges
min_frequency 20-4000 Hz 80 Hz (male voice) Must be < max_frequency
max_frequency 40-8000 Hz 400 Hz (voice) Must be > min_frequency

Formant Extraction (loqa_extract_formants)

Parameter Valid Range Recommended Notes
buffer_size ≥ 2048 samples 2048-4096 samples Larger buffers improve formant resolution
sample_rate 8000-96000 Hz 16000-44100 Hz Higher rates capture higher formants
lpc_order 8-24 12-16 NOT sample_rate / 1000 - use fixed range instead

Historical Note: Issue loqa-expo-dsp#8 - TypeScript calculated lpc_order = sample_rate / 1000 + 2 which gave 46 for 44.1kHz. Swift layer rejected this as out of range, causing all calls to fail. Solution: Use fixed range 8-24 for all sample rates.

FFT (loqa_compute_fft)

Parameter Valid Range Recommended Notes
buffer_size ≥ fft_size = fft_size Larger buffers are truncated
sample_rate 8000-96000 Hz 16000-48000 Hz Affects frequency bin resolution
fft_size Power of 2: 512-8192 2048 or 4096 Non-power-of-2 may fail (impl-specific)

HNR Calculation (loqa_calculate_hnr)

Parameter Valid Range Recommended Notes
buffer_size ≥ 2048 samples 2048-4096 Needs multiple pitch periods
sample_rate 8000-96000 Hz 16000 Hz Standard voice analysis rate
min_frequency 50-300 Hz 75 Hz Lowest expected F0
max_frequency 200-600 Hz 500 Hz Highest expected F0

H1-H2 Calculation (loqa_calculate_h1h2)

Parameter Valid Range Recommended Notes
buffer_size ≥ 2048 samples 4096 samples Needs good spectral resolution
sample_rate 8000-96000 Hz 16000-44100 Hz Higher rates improve harmonic resolution
f0 0.0 or 50-800 Hz Detected pitch Pass 0.0 for auto-detect, or provide known F0

Auto-detect F0: Pass 0.0 (or any negative value) for f0 parameter to automatically detect pitch before calculating H1-H2.

Common FFI Pitfalls (Lessons Learned)

1. Struct Alignment Issues (Fixed in v0.2.1)

  • Problem: Missing #[repr(C)] caused field misalignment
  • Symptom: Correct frequency in Rust, wrong value in Swift/Java
  • Solution: All FFI structs now have #[repr(C)] - verified by CI tests

2. Parameter Validation Mismatches (Fixed in v0.2.2)

  • Problem: TypeScript/Swift layers calculated different validation rules than Rust
  • Symptom: Valid parameters rejected, invalid parameters accepted
  • Solution: Single source of truth - Rust core validates, higher layers trust it

3. Buffer Size Confusion (Documented in v0.2.2)

  • Problem: Users passing large buffers (16384 samples) got false negatives
  • Symptom: Pitch detection failed despite valid audio
  • Solution: Documentation + optional validation warnings (see Issue #5)

4. Memory Leaks with FFT (Prevented by design)

  • Problem: Forgetting to free FFT results leaks ~16-32KB per call
  • Symptom: Gradual memory growth in long-running apps
  • Solution: Always use Swift defer or RAII patterns to ensure cleanup

Implementation Status

  • Crate structure created
  • Pitch detection (YIN + autocorrelation)
  • Formant extraction (LPC-based)
  • FFT utilities
  • Spectral analysis (centroid, tilt, rolloff)
  • HNR calculation (Boersma's autocorrelation method)
  • H1-H2 amplitude difference (vocal weight)
  • iOS FFI layer (C exports for all functions)
  • Android JNI layer (with jni feature)
  • Unit tests (68 passing)
  • FFI integration tests (9 passing)
  • SVD consistency tests (5 passing)
  • Synthetic consistency tests (4 passing)
  • Documentation tests (8 passing)
  • Benchmarks harness
  • Performance benchmarks (validated)

Performance Benchmarks

Validated Performance (2025-11-07) - All targets exceeded ✅

Operation Target Actual (mean) Result Speedup
Pitch detection (100ms audio) <20ms 0.125ms ✅ PASS 160x faster
Formant extraction (500ms audio) <50ms 0.134ms ✅ PASS 373x faster
FFT (2048 points) <10ms ~0.020ms ✅ PASS 500x faster
Spectral analysis <5ms ~0.003ms ✅ PASS 1667x faster
HNR calculation (100ms window) <30ms <1ms ✅ PASS >30x faster
H1-H2 with F0 provided <20ms <1ms ✅ PASS >20x faster

Note: Benchmarks run on Apple M-series silicon. All latency targets easily met with significant performance headroom for real-time voice processing.

Algorithm Details

Custom pYIN Implementation (v0.4.0+)

Starting in v0.4.0, we use a custom pYIN implementation optimized for voice analysis, removing the external pyin crate dependency.

What is pYIN?

pYIN (Mauch & Dixon, 2014) extends the YIN pitch detection algorithm to produce probabilistic pitch estimates, making it more robust for noisy or breathy voice signals.

Key Differences from Standard YIN:

  • YIN: Returns single pitch estimate per frame
  • pYIN: Returns multiple pitch candidates with probabilities, then uses Hidden Markov Model (HMM) to find the smoothest pitch track

Our Voice-Optimized Implementation:

  1. Two-Stage Process:

    • Stage 1: Generate multiple pitch candidates using Beta distribution β(2,18) for thresholds
    • Stage 2: Use Viterbi algorithm on HMM to find optimal pitch track
  2. Voice-Specific Optimizations:

    • Narrower frequency range (80-400 Hz vs. general audio 50-2000 Hz)
    • Tighter HMM transition constraints (voice pitch changes slowly)
    • Voice-tuned Beta distribution concentrating probability near threshold=0.1
  3. Benefits:

    • No external dependencies - fully integrated implementation
    • Better handling of breathy voice - multiple candidates provide robustness
    • Smoother pitch tracks - HMM enforces temporal consistency
    • Voiced probability per frame - soft voiced/unvoiced decisions (not just binary)
    • Smaller binary size - only includes what we need for voice
  4. Performance:

    • ~65-67 µs per 100ms frame (16kHz sample rate)
    • ~1.5-2x overhead vs. standard YIN (acceptable tradeoff for improved accuracy)
    • Still meets 160-500x real-time performance target

References:

Acoustic Measures Reference

HNR (Harmonics-to-Noise Ratio)

Measures the ratio of harmonic (periodic) to noise (aperiodic) energy in voice - the primary acoustic indicator of breathiness.

HNR Range Interpretation
18-25+ dB Clear, less breathy voice
12-18 dB Moderate breathiness
<10 dB Very breathy or pathological voice

H1-H2 (First/Second Harmonic Difference)

Measures the amplitude difference between the fundamental and second harmonic - indicates vocal weight.

H1-H2 Range Interpretation
>5 dB Lighter, breathier vocal quality
0-5 dB Balanced vocal weight
<0 dB Fuller, heavier vocal quality

Test Data

Saarbrücken Voice Database

This library uses samples from the Saarbrücken Voice Database for consistency validation testing.

License: CC BY 4.0

Attribution: Pützer, M. & Barry, W.J., Former Institute of Phonetics, Saarland University. Available at Zenodo.

The SVD provides lab-quality voice recordings including:

  • Sustained vowels (/a:/, /i:/, /u:/) at low, normal, and high pitch
  • 851 healthy control speakers
  • 1002 speakers with documented voice pathologies
  • 50 kHz sample rate, controlled recording conditions

Setting Up Test Data

# 1. Download SVD from Zenodo (CC BY 4.0 license)
#    https://zenodo.org/records/16874898

# 2. Install conversion dependencies
pip install scipy numpy

# 3. Convert SVD files to test format
python scripts/download_svd.py /path/to/extracted/svd

Test Sample Requirements

For comprehensive validation, the library needs test samples with these characteristics:

Function Sample Requirements Recommended Datasets
Pitch Detection Male (80-180 Hz), Female (160-300 Hz), varied intonation Saarbrücken Voice Database, PTDB-TUG
Formant Extraction Sustained vowels /a/, /i/, /u/, /e/, /o/ from multiple speakers Hillenbrand Vowel Database, VTR-TIMIT
HNR Breathy, modal, and clear voice qualities Saarbrücken Voice Database
H1-H2 Light to full voice qualities, different phonation types UCLA Voice Quality Database, VoiceSauce reference recordings
Spectral Dark to bright voice qualities Voice quality databases with perceptual labels

Development

# Build
cargo build --release

# Test
cargo test

# Benchmark
cargo bench

# Documentation
cargo doc --open

License

MIT

Commit count: 0

cargo fmt