| Crates.io | mel_spec |
| lib.rs | mel_spec |
| version | 0.3.4 |
| created_at | 2023-07-22 05:49:29.568525+00 |
| updated_at | 2026-01-04 16:41:50.146651+00 |
| description | Mel spectrograms aligned to the results from the whisper.cpp, pytorch and librosa reference implementations and suited to streaming audio. |
| homepage | |
| repository | https://github.com/wavey-ai/mel-spec |
| max_upload_size | |
| id | 923005 |
| size | 119,610 |
A Rust implementation of mel spectrograms with support for:
See wavey-ai/hush for live demo

use mel_spec::prelude::*;
The fbank module provides Kaldi-style filterbank features with parity to kaldi_native_fbank. This is useful for speaker embedding models like WeSpeaker and pyannote.
use mel_spec::fbank::{Fbank, FbankConfig};
// Default config matches kaldi_native_fbank defaults
let config = FbankConfig::default();
let fbank = Fbank::new(config);
// Compute features from audio samples (mono, f32, 16kHz)
let features = fbank.compute(&samples);
// Returns Array2<f32> with shape (num_frames, 80)
Kaldi defaults:
Mel filterbanks within 1.0e-7 of librosa and identical to whisper GGML model-embedded filters.
use mel_spec::mel::mel;
let sampling_rate = 16000.0;
let fft_size = 400;
let n_mels = 80;
let filters = mel(sampling_rate, fft_size, n_mels, None, None, false, true);
// Returns Array2<f64> with shape (80, 201)
STFT with overlap-and-save that has parity with PyTorch and whisper.cpp. Suitable for streaming audio.
use mel_spec::stft::Spectrogram;
let fft_size = 400;
let hop_size = 160;
let mut spectrogram = Spectrogram::new(fft_size, hop_size);
// Add PCM audio samples
let samples: Vec<f32> = vec![0.0; 1600];
if let Some(fft_frame) = spectrogram.add(&samples) {
// Use FFT result
}
Apply a pre-computed filterbank to FFT results. Output is identical to whisper.cpp and whisper.py.
use mel_spec::mel::MelSpectrogram;
use ndarray::Array1;
use num::Complex;
let fft_size = 400;
let sampling_rate = 16000.0;
let n_mels = 80;
let mut mel_spec = MelSpectrogram::new(fft_size, sampling_rate, n_mels);
let fft_input = Array1::from(vec![Complex::new(1.0, 0.0); fft_size]);
let mel_frame = mel_spec.add(fft_input);
For creating spectrograms from streaming audio, see RingBuffer in rb.rs.
Mel spectrograms can be saved as 8-bit TGA images (uncompressed, supported by macOS and Windows). These images encode quantized mel spectrogram data that whisper.cpp can process directly without audio input.
use mel_spec::quant::{save_tga_8bit, load_tga_8bit};
// Save spectrogram
save_tga_8bit(&mel_data, "spectrogram.tga").unwrap();
// Load and use with whisper.cpp
let mel = load_tga_8bit("spectrogram.tga").unwrap();
TGA files are lossless for speech-to-text - they encode all information available in the model's view of raw audio.
ffmpeg -i audio.mp3 -f f32le -ar 16000 -ac 1 pipe:1 | ./target/release/tga_whisper -t spectrogram.tga
"the quest for peace."
Uses Sobel edge detection to find speech boundaries in mel spectrograms. This enables real-time processing by finding natural cut points between words/phrases.
use mel_spec::vad::{VoiceActivityDetector, DetectionSettings};
let settings = DetectionSettings {
min_energy: 1.0,
min_y: 3,
min_x: 5,
min_mel: 0,
min_frames: 100,
};
let vad = VoiceActivityDetector::new(settings);
Speech in mel spectrograms is characterized by clear gradients. The VAD finds vertical gaps suitable for cutting, and drops frames that look like gaps in speech (which cause Whisper hallucinations).

Examples from JFK speech:
Energy but no speech - VAD correctly rejects:

Fleeting word - VAD correctly detects:

Full JFK transcript with VAD: jfk_transcript_golden.txt
Benchmarks on Apple M1 Pro (single-threaded, release build):
| Audio Length | Frames | Time | Throughput |
|---|---|---|---|
| 10s | 997 | 21ms | 476x realtime |
| 60s | 5997 | 124ms | 484x realtime |
| 300s (5 min) | 29997 | 622ms | 482x realtime |
Full mel spectrogram pipeline (STFT + mel filterbank + log) at 16kHz, FFT size 512, hop 160, 80 mel bins.
CPU performance is excellent - processing 5 minutes of audio in 622ms means the library is ~480x faster than realtime on a single core.
This library focuses on pure Rust CPU implementation. For GPU acceleration, consider:
| Option | Speedup | Notes |
|---|---|---|
| NVIDIA NeMo | ~10x over CPU | Python/PyTorch, uses cuBLAS/cuDNN, best for batch processing |
| torchaudio | ~5-10x | Python/PyTorch, CUDA backend |
| mel-spec gpu branch | ~1.6x | Experimental, requires CUDA toolkit + nvcc |
Options:
The gpu branch keeps the full pipeline on GPU (STFT → mel filterbank → log). Requires C++ CUDA kernels and NVIDIA hardware.
whisper.cpp produces mel spectrograms with 1.0e-6 precision, but these are invariant to 8-bit quantization. We can save as 8-bit images without losing useful information.
We only need 1.0e-1 precision for accurate results, and rounding may actually improve some difficult transcriptions:
Original: [0.158, 0.266, 0.076, 0.196, 0.167, ...]
Rounded: [0.2, 0.3, 0.1, 0.2, 0.2, ...]
Once quantized, the spectrograms are the same:
(top: original, bottom: rounded to 1.0e-1)
Speech is encapsulated almost entirely in the frequency domain, and the mel scale effectively divides frequencies into 80 bins. 8-bits of grayscale is probably overkill - it could be compressed further.