Crates.io | voice-stream |
lib.rs | voice-stream |
version | |
source | src |
created_at | 2024-11-16 12:32:05.509031 |
updated_at | 2024-11-18 19:50:11.452897 |
description | Voice stream is a real-time audio stream processing with voice activity detection. This library provides a high-level interface for capturing audio input, performing voice detection using both WebRTC VAD and Silero VAD, and processing audio streams. |
homepage | https://github.com/voicekick/voicekick |
repository | https://github.com/voicekick/voicekick |
max_upload_size | |
id | 1450346 |
Cargo.toml error: | TOML parse error at line 17, column 1 | 17 | autolib = false | ^^^^^^^ unknown field `autolib`, expected one of `name`, `version`, `edition`, `authors`, `description`, `readme`, `license`, `repository`, `homepage`, `documentation`, `build`, `resolver`, `links`, `default-run`, `default_dash_run`, `rust-version`, `rust_dash_version`, `rust_version`, `license-file`, `license_dash_file`, `license_file`, `licenseFile`, `license_capital_file`, `forced-target`, `forced_dash_target`, `autobins`, `autotests`, `autoexamples`, `autobenches`, `publish`, `metadata`, `keywords`, `categories`, `exclude`, `include` |
size | 0 |
A Rust library for real-time voice activity detection and audio stream processing. This library provides a high-level interface for capturing audio input, performing voice detection using both WebRTC VAD and Silero VAD, and processing audio streams.
use voice_stream::VoiceStream;
use voice_stream::cpal::traits::StreamTrait;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create a default voice stream with receiver
let (voice_stream, receiver) = VoiceStream::default_device().unwrap();
// Start capturing audio
voice_stream.play().unwrap();
// Receive voice data chunks
for voice_data in receiver {
// Process voice data (Vec<f32>)
println!("Received voice data chunk of size: {}", voice_data.len());
}
Ok(())
}
flowchart TD
Start --> Capture[Capture audio input from device mono/multi channels at various sample rates]
Capture --> Convert
IntoMono --> TakeBuffer[Buffer f32 samples to at least 512 size]
TakeBuffer --> Step1[Split off samples buffer when >= 512]
subgraph Resampler
%% Nodes
Convert[Convert i8, i16, i32 or f32 samples to f32]
Resample[Resample to target sample rate 16,000 Hz]
IntoMono[Convert multi channel sound to mono]
%% Flow connections
Convert --> Resample --> IntoMono
end
subgraph Voice Detection
%% Nodes
Step1[Convert to 8kHz and check is_noise]
webrtc[is_noise = webrtc_vad_is_noise samples]
Step2[Get predict from silero_vad_prediction]
silero[predict = silero_vad_prediction samples]
is_voice[is_voice = predict > silero_vad_voice_threshold]
Decision{Match is_noise, is_voice}
%% Subgraphs for Each Case
subgraph CaseTrueTrue ["Case: is_noise and is_voice"]
ActionTT[Accumulate samples into samples_buffer]
ReturnTT[Return None]
end
subgraph CaseTrueFalse ["Case: is_noise and !is_voice"]
ActionTF[Clear samples_buffer]
ReturnTF[Return None]
end
subgraph CaseFalse ["Case: is_noise"]
ActionF[Push predict to silero_predict_buffer]
BufferEmpty{Is samples_buffer empty?}
ReturnNone[Return None]
ReturnSamples[Return all voice samples]
end
%% Flow connections
Step1 --> webrtc --> Step2 --> silero --> is_voice --> Decision
%% Decision branches
Decision -->|is_noise = true and is_voice = true| CaseTrueTrue
CaseTrueTrue --> ReturnTT
Decision -->|is_noise = true and is_voice = false| CaseTrueFalse
CaseTrueFalse --> ReturnTF
Decision -->|is_noise = false| CaseFalse
CaseFalse --> BufferEmpty
BufferEmpty -->|Yes| ReturnNone
BufferEmpty -->|No| ReturnSamples
end
%%Nodes
ProcessVoiceDetectionSamples{Process voice detection}
ChannelSendData{Channel send}
NoiseDiscard[Disregarded into noise void]
User[User channel receiver]
ReturnNone -->|None| ProcessVoiceDetectionSamples
ReturnTT -->|None| ProcessVoiceDetectionSamples
ReturnTF -->|None| ProcessVoiceDetectionSamples
ReturnSamples -->|Some| ProcessVoiceDetectionSamples
ChannelSendData --> User
%% ChannelSendData branches
ProcessVoiceDetectionSamples -->|Some voice| ChannelSendData
ProcessVoiceDetectionSamples -->|No voice| NoiseDiscard
The library provides a builder pattern for advanced configuration:
use voice_stream::{VoiceStreamBuilder, WebRtcVoiceActivityProfile};
use voice_stream::cpal::traits::{DeviceTrait, HostTrait, StreamTrait};
let (tx, rx) = std::sync::mpsc::channel();
let host = cpal::default_host();
let select_device = "default";
// Set up the input device and stream with the default input config.
let device = if select_device == "default" {
host.default_input_device()
} else {
host.input_devices()
.expect("Failed to get input devices")
.find(|x| x.name().map(|y| y == select_device).unwrap_or(false))
}
.expect("failed to find input device");
let config = device
.default_input_config()
.expect("Failed to get default input config");
let voice_stream = VoiceStreamBuilder::new(config, device, tx)
.with_sound_buffer_until_size(1024)
.with_voice_detection_silero_voice_threshold(0.5)
.with_voice_detection_webrtc_profile(WebRtcVoiceActivityProfile::AGGRESSIVE)
.build()
.unwrap();