| Crates.io | sensevoice-cli |
| lib.rs | sensevoice-cli |
| version | 0.1.9 |
| created_at | 2025-11-04 09:02:09.752747+00 |
| updated_at | 2025-11-11 08:12:49.739943+00 |
| description | cli tool for speech recognition using sensevoice-small, by restsend.com |
| homepage | |
| repository | https://github.com/restsend/sensevoice-cli |
| max_upload_size | |
| id | 1915984 |
| size | 547,914 |
A lightweight command-line front end for the SenseVoice multilingual speech recognition model.
Linux:
apt-get install -y cmake pkg-config
Mac:
brew install cmake
cargo install sensevoice-cli
# or without opus(.ogg) format
cargo install sensevoice-cli --no-default-features
SenseVoice Rust CLI (ORT + Symphonia + HF Hub)
Usage: sensevoice-cli [OPTIONS] [AUDIO]
Arguments:
[AUDIO] Input audio file (wav/mp3/ogg/flac/opus/vorbis)
Options:
--models-path <MODELS_PATH> Download/cache directory for models and resources [default: ~/.sensevoice-models]
-t, --threads <NUM_THREADS> Intra-op threads for ONNX Runtime [default: 1]
-l, --language <LANGUAGE> Language code: auto, zh, en, yue, ja, ko, nospeech [default: auto]
--use-itn Use ITN post-processing
--vad-int8 Use int8 Silero VAD model
--no-vad Disable Silero VAD segmentation
--vad-threshold <VAD_THRESHOLD> VAD probability threshold (0.0-1.0) [default: 0.5]
--vad-min-speech-ms <VAD_MIN_SPEECH_MS>
Minimum speech duration in milliseconds [default: 400]
--vad-min-silence-ms <VAD_MIN_SILENCE_MS>
Minimum silence duration in milliseconds [default: 200]
--vad-speech-pad-ms <VAD_SPEECH_PAD_MS>
Additional padding in milliseconds around segments [default: 120]
--vad-merge-gap-ms <VAD_MERGE_GAP_MS>
Merge adjacent segments separated by <= gap milliseconds [default: 1200]
--hf-endpoint <HF_ENDPOINT> Optional HF endpoint/mirror (overrides env HF_ENDPOINT/HF_MIRROR)
--log <LOG> Log level
-o, --output <OUTPUT> Output JSON file path
-c, --channels <CHANNELS> Maximum number of audio channels to transcribe (0 = all) [default: 1]
--download-only Download models only and exit
-h, --help Print help
-V, --version Print version
sensevoice-cli path/to/audio.wav
sensevoice-cli -o transcript.json path/to/audio.wav
Output:
[
{
"channel": 0,
"duration_sec": 7.152,
"rtf": 0.019359846,
"segments": [
{
"start_sec": 1.09,
"end_sec": 3.614,
"text": "THE DRIBL TEETHIN CALLD FOR THE BOY",
"tags": []
},
{
"start_sec": 3.842,
"end_sec": 6.59,
"text": "AND PRESENTED HIM WITH FIFTY PIECES OF COATD",
"tags": []
}
]
}
]
~/.sensevoice-models on first run (override with --models-path).sensevoice-cli -l zh --use-itn -c 2 samples/demo.wav
-l/--language: explicit language hint (auto, zh, en, yue, ja, ko, nospeech).--use-itn: enable inverse text normalization for cleaner numbers and dates.-c/--channels: limit the number of channels to transcribe (default 1, set 0 for all).-o/--output: write JSON to a file instead of stdout.--log: set log verbosity (e.g. info, debug).--download-only: prefetch model assets without running inference.--no-vad: bypass voice activity detection and transcribe each channel as a whole.--vad-*: tune Silero VAD behaviour (threshold, speech/silence durations, padding, merge gap) without editing code.--hf-endpoint https://hf-mirror.com (or set HF_ENDPOINT/HF_MIRROR) to speed up model fetches from mainland China.--vad-int8 to prefer the quantized Silero VAD model when CPU resources are limited.--vad-* flags (threshold, speech/silence durations, padding, merge gap).-t/--threads to match available CPU cores. GPU execution currently requires rebuilding with CUDA-enabled ONNX Runtime..ort graphs next to the downloaded models; later runs reuse them to avoid ONNX Runtime re-optimization costs.