Crates.io | kalosm-sound |
lib.rs | kalosm-sound |
version | 0.3.4 |
source | src |
created_at | 2023-12-16 17:00:51.021682 |
updated_at | 2024-08-28 18:16:41.905465 |
description | A set of pretrained audio models |
homepage | |
repository | https://github.com/floneum/floneum |
max_upload_size | |
id | 1071864 |
size | 503,971 |
Kalosm Sound is a collection of audio models and utilities for the Kalosm framework. It supports several voice activity detection models, and provides utilities for transcribing audio into text.
Models in kalosm sound work with any [AsyncSource
]. You can use [MicInput::stream
] to stream audio from the microphone, or any synchronous audio source that implements [rodio::Source
] like a mp3 or wav file.
You can transform the audio streams with:
[VoiceActivityDetectorExt::voice_activity_stream
]: Detect voice activity in the audio data
[DenoisedExt::denoise_and_detect_voice_activity
]: Denoise the audio data and detect voice activity
[AsyncSourceTranscribeExt::transcribe
]: Chunk an audio stream based on voice activity and then transcribe the chunked audio data
[VoiceActivityStreamExt::rechunk_voice_activity
]: Chunk an audio stream based on voice activity
[VoiceActivityStreamExt::filter_voice_activity
]: Filter chunks of audio data based on voice activity
[TranscribeChunkedAudioStreamExt::transcribe
]: Transcribe a chunked audio stream
VAD models are used to detect when a speaker is speaking in a given audio stream. The simplest way to use a VAD model is to create an audio stream and call [VoiceActivityDetectorExt::voice_activity_stream
] to stream audio chunks that are actively being spoken:
use kalosm::sound::*;
#[tokio::main]
async fn main() {
// Get the default microphone input
let mic = MicInput::default();
// Stream the audio from the microphone
let stream = mic.stream().unwrap();
// Detect voice activity in the audio stream
let mut vad = stream.voice_activity_stream();
while let Some(input) = vad.next().await {
println!("Probability: {}", input.probability);
}
}
Kalosm also provides [VoiceActivityStreamExt::rechunk_voice_activity
] to collect chunks of consecutive audio samples with a high vad probability. This can be useful for applications like speech recognition where context between consecutive audio samples is important.
use kalosm::sound::*;
use rodio::Source;
#[tokio::main]
async fn main() {
// Get the default microphone input
let mic = MicInput::default();
// Stream the audio from the microphone
let stream = mic.stream().unwrap();
// Chunk the audio into chunks of speech
let vad = stream.voice_activity_stream();
let mut audio_chunks = vad.rechunk_voice_activity();
// Print the chunks as they are streamed in
while let Some(input) = audio_chunks.next().await {
println!("New voice activity chunk with duration {:?}", input.total_duration());
}
}
You can use the [Whisper
] model to transcribe audio into text. Kalosm can transcribe any [AsyncSource
] into a transcription stream with the [AsyncSourceTranscribeExt::transcribe
] method:
use kalosm::sound::*;
#[tokio::main]
async fn main() {
// Get the default microphone input
let mic = MicInput::default();
// Stream the audio from the microphone
let stream = mic.stream().unwrap();
// Transcribe the audio into text with the default Whisper model
let mut transcribe = stream.transcribe(Whisper::new().await.unwrap());
// Print the text as it is streamed in
transcribe.to_std_out().await.unwrap();
}