| Crates.io | whisper-apr |
| lib.rs | whisper-apr |
| version | 0.2.0 |
| created_at | 2026-01-02 22:45:03.794032+00 |
| updated_at | 2026-01-22 22:51:44.748612+00 |
| description | WASM-first automatic speech recognition engine implementing OpenAI Whisper |
| homepage | |
| repository | https://github.com/paiml/whisper.apr |
| max_upload_size | |
| id | 2019299 |
| size | 10,983,039 |
Production-Ready OpenAI Whisper Implementation for Browser & Edge
whisper.apr is a pure Rust implementation of OpenAI's Whisper speech recognition model, engineered from the ground up for WebAssembly (WASM) deployment. It features a custom .apr model format optimized for browser streaming, SIMD acceleration, and int4/int8 quantization for efficient edge inference.
| Feature | whisper.apr | whisper.cpp | whisper-web |
|---|---|---|---|
| Pure Rust | Yes | C++ | JavaScript |
| WASM-First | Yes | Ported | Native |
| Int4 Quantization | Yes | Int8 only | No |
| Streaming Inference | Yes | Batch only | Limited |
| Zero-Copy Loading | Yes | No | No |
| Custom Format (.apr) | Yes | GGML | ONNX |
| Browser-Native | Yes | Emscripten | Yes |
| Model | Parameters | .apr Size (Int4) | .apr Size (Int8) | RTF* |
|---|---|---|---|---|
| tiny | 39M | 20 MB | 39 MB | 0.3x |
| base | 74M | 37 MB | 74 MB | 0.5x |
| small | 244M | 122 MB | 244 MB | 0.8x |
| medium | 769M | 385 MB | 769 MB | 1.2x |
| large | 1.5B | 750 MB | 1.5 GB | 2.0x |
*RTF = Real-Time Factor on M1 MacBook (lower is faster)
<script type="module">
import init, { WhisperModel } from './whisper_apr.js';
async function transcribe() {
await init();
const model = await WhisperModel.load('/models/whisper-tiny.apr');
const audioData = await fetchAudioAsFloat32Array('/audio/sample.wav');
const result = await model.transcribe(audioData);
console.log(result.text);
}
</script>
use whisper_apr::{WhisperModel, TranscribeOptions};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let model = WhisperModel::load("whisper-tiny.apr")?;
let audio = whisper_apr::load_audio("sample.wav")?;
let result = model.transcribe(&audio, TranscribeOptions::default())?;
println!("{}", result.text);
Ok(())
}
use whisper_apr::{StreamingProcessor, StreamingConfig};
let config = StreamingConfig {
chunk_duration_ms: 5000,
overlap_ms: 500,
language: Some("en"),
};
let mut processor = StreamingProcessor::new(model, config);
// Feed audio chunks as they arrive
while let Some(chunk) = audio_source.next_chunk() {
if let Some(partial) = processor.process_chunk(&chunk)? {
println!("Partial: {}", partial.text);
}
}
let final_result = processor.finalize()?;
println!("Final: {}", final_result.text);
wasm32-unknown-unknown target# Clone the repository
git clone https://github.com/paiml/whisper.apr.git
cd whisper.apr
# Build native (for testing)
cargo build --release
# Build WASM
make wasm
# Run tests
cargo test
Convert existing Whisper models to .apr format:
# From safetensors (Hugging Face)
cargo run --bin convert -- \
--input openai/whisper-tiny \
--output whisper-tiny.apr \
--quantize int8
# With int4 quantization for smaller size
cargo run --bin convert -- \
--input openai/whisper-small \
--output whisper-small-int4.apr \
--quantize int4
┌─────────────────────────────────────────────────────────────────┐
│ whisper.apr │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Audio │ │ Encoder │ │ Decoder │ │
│ │ Processing │──│ (6 layers) │──│ (6 layers) │──► Text │
│ │ │ │ │ │ │ │
│ │ • Resampling │ │ • Self-Attn │ │ • Self-Attn │ │
│ │ • Mel Spec │ │ • FFN │ │ • Cross-Attn │ │
│ │ • STFT │ │ • LayerNorm │ │ • FFN │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Tokenizer │ │ Quantization │ │ SIMD │ │
│ │ │ │ │ │ Primitives │ │
│ │ • BPE │ │ • Int4/Int8 │ │ │ │
│ │ • 51,865 tok │ │ • Mixed Prec │ │ • MatMul │ │
│ │ • Multi-lang │ │ • Zero-Copy │ │ • Softmax │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
| Module | Description | LOC |
|---|---|---|
audio/ |
Mel spectrogram, resampling, streaming | ~2,500 |
model/ |
Encoder, decoder, attention, quantization | ~8,000 |
tokenizer/ |
BPE tokenizer, vocabulary, special tokens | ~1,500 |
inference/ |
Greedy/beam search decoding, KV cache | ~3,000 |
format/ |
.apr format, compression, streaming load | ~2,000 |
wasm/ |
JavaScript bindings, Web Worker support | ~1,500 |
| Total | ~24,500 |
The .apr (Aprender) format is optimized for streaming and browser deployment:
┌────────────────────────────────────────┐
│ APR File Structure │
├────────────────────────────────────────┤
│ Magic: "APR\0" (4 bytes) │
│ Version: u32 (4 bytes) │
│ Header Size: u32 (4 bytes) │
├────────────────────────────────────────┤
│ Model Config (JSON, compressed) │
│ • n_vocab, n_audio_ctx, n_audio_state │
│ • n_audio_head, n_audio_layer │
│ • n_text_ctx, n_text_state, ... │
├────────────────────────────────────────┤
│ Vocabulary (BPE tokens, compressed) │
├────────────────────────────────────────┤
│ Tensor Blocks (streaming-ready) │
│ • Block header (name, shape, dtype) │
│ • Compressed tensor data (zstd) │
│ • Quantization scales (if int4/int8) │
└────────────────────────────────────────┘
The .apr format is optimized for WASM delivery. Benchmark results for Whisper Tiny:
| Format | Size | Compression | WASM Ready |
|---|---|---|---|
| SafeTensors | 145 MB | baseline | ❌ Too large |
| GGML | 75 MB | 52% | ⚠️ Moderate |
| APR-f32 | 145 MB | 100% | ❌ Too large |
| APR-int8 | 37 MB | 25% | ✅ Excellent |
| Metric | APR-f32 | APR-int8 | Improvement |
|---|---|---|---|
| File Read | 87ms | 21ms | 4x faster |
| Parse | 73ms | 19ms | 4x faster |
| Model Load | 490ms | 416ms | 15% faster |
| First Token | ~280ms | ~280ms | Same quality |
Run the benchmark yourself:
cargo run --example format_comparison --release
| Platform | Time | Memory | RTF |
|---|---|---|---|
| Native (M1 Mac) | 9.2s | 180 MB | 0.31x |
| Native (x86 AVX2) | 12.1s | 180 MB | 0.40x |
| WASM (Chrome) | 18.5s | 220 MB | 0.62x |
| WASM (Firefox) | 21.3s | 225 MB | 0.71x |
| WASM (Safari) | 24.1s | 230 MB | 0.80x |
/// Main model interface
pub struct WhisperModel { /* ... */ }
impl WhisperModel {
/// Load model from .apr file
pub fn load(path: impl AsRef<Path>) -> WhisperResult<Self>;
/// Load with custom options
pub fn load_with_options(path: impl AsRef<Path>, opts: LoadOptions) -> WhisperResult<Self>;
/// Transcribe audio samples (f32, 16kHz mono)
pub fn transcribe(&self, audio: &[f32], opts: TranscribeOptions) -> WhisperResult<TranscribeResult>;
/// Translate to English
pub fn translate(&self, audio: &[f32], opts: TranscribeOptions) -> WhisperResult<TranscribeResult>;
/// Detect language
pub fn detect_language(&self, audio: &[f32]) -> WhisperResult<DetectedLanguage>;
}
/// Transcription options
pub struct TranscribeOptions {
pub language: Option<String>, // Force language (None = auto-detect)
pub task: Task, // Transcribe or Translate
pub beam_size: usize, // Beam search width (1 = greedy)
pub best_of: usize, // Sample multiple and pick best
pub temperature: f32, // Sampling temperature
pub compression_ratio_threshold: f32,
pub logprob_threshold: f32,
pub no_speech_threshold: f32,
}
/// Transcription result
pub struct TranscribeResult {
pub text: String,
pub segments: Vec<Segment>,
pub language: String,
pub language_probability: f32,
}
// TypeScript definitions
export class WhisperModel {
static load(url: string): Promise<WhisperModel>;
transcribe(audio: Float32Array, options?: TranscribeOptions): Promise<TranscribeResult>;
translate(audio: Float32Array, options?: TranscribeOptions): Promise<TranscribeResult>;
detectLanguage(audio: Float32Array): Promise<DetectedLanguage>;
free(): void;
}
export interface TranscribeOptions {
language?: string;
task?: 'transcribe' | 'translate';
beamSize?: number;
temperature?: number;
}
export interface TranscribeResult {
text: string;
segments: Segment[];
language: string;
languageProbability: number;
}
Zero-JavaScript demos showcasing whisper.apr capabilities. All demos are pure Rust/WASM with Probar serving (handles required COOP/COEP headers for SharedArrayBuffer):
cd demos && probar serve
# Open http://localhost:8080
| Demo | Description |
|---|---|
| Real-Time Transcription | Live microphone transcription with streaming results |
| File Upload Transcription | Upload audio/video files with timeline visualization |
| Real-Time Translation | Live speech-to-English translation (99 languages) |
| File Upload Translation | Batch translation of uploaded media files |
cd demos && probar test -v # Run all demo tests
probar coverage # Pixel regression tests
whisper.apr/
├── src/
│ ├── lib.rs # Library entry point
│ ├── audio/ # Audio processing
│ │ ├── mel.rs # Mel spectrogram
│ │ ├── resampler.rs # Audio resampling
│ │ ├── batch.rs # Batch preprocessing
│ │ └── streaming.rs # Streaming processor
│ ├── model/ # Neural network
│ │ ├── encoder.rs # Transformer encoder
│ │ ├── decoder.rs # Transformer decoder
│ │ ├── attention.rs # Multi-head attention
│ │ └── quantized.rs # Quantization support
│ ├── tokenizer/ # BPE tokenizer
│ ├── inference/ # Decoding strategies
│ ├── format/ # .apr format
│ └── wasm/ # WASM bindings
├── demos/ # Demo applications
├── benches/ # Criterion benchmarks
├── tests/ # Integration tests
└── docs/ # Documentation
make build # Build release
make wasm # Build WASM package
make test # Run all tests
make bench # Run benchmarks
make lint # Clippy + fmt check
make coverage # Generate coverage report
make docs # Build documentation
# Unit tests
cargo test --lib
# Integration tests
cargo test --test integration
# Property tests
cargo test --test property_tests
# WASM tests (requires wasm-pack)
wasm-pack test --headless --chrome
whisper.apr follows EXTREME TDD methodology with comprehensive quality gates:
| Metric | Target | Achieved |
|---|---|---|
| Test Count | 1500+ | 1868 |
| Line Coverage | 95% | 95% |
| Property Tests | 15+ | 19 |
| WASM Binary | <700 KB | 668 KB |
| Clippy Warnings | 0 | 0 |
| Security Audit | Pass | Pass |
# .pmat/comply.toml
[metrics]
total_tickets = 64
completed_tickets = 64
completion_rate = 100.0
test_count = 1868
line_coverage = 95.0
property_tests = 19
source_loc = 24500
wasm_binary_kb = 668
Licensed under the MIT License. See LICENSE for details.
Built with Rust and a passion for edge AI