native-pyannote-rs

Crates.io	native-pyannote-rs
lib.rs	native-pyannote-rs
version	0.1.4
created_at	2026-01-15 23:17:22.051785+00
updated_at	2026-01-16 11:40:20.288318+00
description	Speaker diarization using pyannote in Rust
homepage
repository	https://github.com/RustedBytes/pyannote-rs
max_upload_size
id	2047329
size	427,403

dev-team (github:iron:dev-team)

documentation

README

native-pyannote-rs

Pyannote audio diarization in native Rust.

This is a fork of https://github.com/thewh1teagle/pyannote-rs with Rust native crate for audio feature extraction using kaldi-native-fbank instead of bindings to C++ variant (knf-rs). Also, it uses Burn for model inference instead of ONNX Runtime.

Features

Compute 1 hour of audio in less than a minute on CPU.
Pure Rust inference with Burn (ndarray backend), no onnxruntime runtime dependency.
Accurate timestamps with Pyannote segmentation.
Identify speakers with wespeaker embeddings.

Examples

See examples

Usage

use pyannote_rs::{read_wav, EmbeddingExtractor, EmbeddingManager, Segmenter};

fn main() -> anyhow::Result<()> {
    let (samples, sample_rate) = read_wav("audio.wav")?;
    let segmenter = Segmenter::new("src/nn/segmentation/model.bpk")?;
    let extractor = EmbeddingExtractor::new("src/nn/speaker_identification/model.bpk")?;
    let mut speakers = EmbeddingManager::new(4);

    for segment in segmenter.iter_segments(&samples, sample_rate)? {
        let segment = segment?;
        let embedding = extractor.extract(&segment.samples, sample_rate)?;
        let speaker_id = speakers.upsert(&embedding, 0.5).unwrap_or(0);

        println!("{:.2}-{:.2} => speaker {}", segment.start, segment.end, speaker_id);
    }

    Ok(())
}

How it works

pyannote-rs uses 2 models for speaker diarization:

Segmentation: segmentation-3.0 identifies when speech occurs.
Speaker Identification: wespeaker-voxceleb-resnet34-LM identifies who is speaking.

Inference is powered by Burn with the ndarray backend using bundled .bpk weights.

The segmentation model processes up to 10s of audio, using a sliding window approach (iterating in chunks).
The embedding model processes filter banks (audio features) extracted with kaldi-native-fbank.

Speaker comparison (e.g., determining if Alice spoke again) is done using cosine similarity.

Running with Burn on Metal or CUDA

The library ships with the ndarray (CPU) backend by default. To run the models on GPU/Metal, switch the Burn backend and enable the corresponding feature in Cargo.toml:

Apple Silicon (Metal via WGPU): change the backend aliases in src/nn/mod.rs to:
```
use burn::backend::wgpu::{AutoGraphicsApi, Wgpu, WgpuDevice};

pub type BurnBackend = Wgpu<f32, AutoGraphicsApi>;
pub type BurnDevice = WgpuDevice;
```
and enable the WGPU feature: burn = { version = "~0.20", features = ["wgpu"] }. WGPU will pick Metal automatically on M1/M2 Macs.
NVIDIA GPU (CUDA): use the CUDA backend instead:
```
use burn::backend::cudarc::{Cuda, CudaDevice};

pub type BurnBackend = Cuda<f32>;
pub type BurnDevice = CudaDevice;
```
and enable CUDA JIT in Cargo: burn = { version = "~0.20", features = ["cuda-jit"] }. Ensure the NVIDIA driver and CUDA toolkit (12.x) are installed and visible to the build.

After switching the backend and rebuilding dependencies, run any example in release mode to warm up kernels (first call will compile kernels): cargo run --release --example infinite.

Credits

Big thanks to pyannote-onnx and kaldi-native-fbank

Commit count: 0

native-pyannote-rs

documentation

README

native-pyannote-rs

Features

Examples

Usage

Running with Burn on Metal or CUDA

Credits

cargo fmt