Crates.io | whisper-overlay |
lib.rs | whisper-overlay |
version | 1.0.0 |
source | src |
created_at | 2024-06-23 12:50:16.10695 |
updated_at | 2024-06-23 12:50:16.10695 |
description | A wayland overlay providing speech-to-text functionality for any application via a global push-to-talk hotkey |
homepage | https://github.com/oddlama/whisper-overlay |
repository | https://github.com/oddlama/embedded-devices |
max_upload_size | |
id | 1281183 |
size | 145,197 |
A wayland overlay providing speech-to-text functionality for any application via a global push-to-talk hotkey. Anything you are saying while holding the hotkey will be transcribed in real-time and shown on-screen. The live transcriptions use a faster but less accurate model but as soon as you pause speaking or release the hotkey, the transcription will be updated using a second, more accurate model. This resulting text will then be tryped into the window that is currently focused.
layer-shell
and virtual-keyboard-v1
to support most wayland compositorsThis makes use of the RealtimeSTT python library to provide live transcriptions, which in turn uses faster-whisper for both the actual realtime and high-fidelity transcription model.
Requirements:
Clone the repository
git clone https://github.com/oddlama/whisper-overlay
cd whisper-overlay
Run the realtime-stt-server using docker
docker-compose up
Install and run whisper-overlay
cargo install whisper-overlay
whisper-overlay overlay
# Or alternatively select a hotkey:
#whisper-overlay overlay --hotkey KEY_F12
Now press and hold Right Ctrl to transcribe. For a permanent installation
I recommend starting the server as a systemd service and adding the whisper-overlay overlay
as a startup command to your desktop environment / compositor.
In principle you just need to start ./realtime-stt-server.py
and it will be listening for requests on localhost:7007
.
You can then start whisper-overlay overlay
to transcribe text. The default hotkey is Right Ctrl,
but you can change this by specifying any name from evdev::Key,
for example KEY_F12
for F12. Beware that the hotkey is only observed and will still be passed to the application that is focused.
If you want to change the server settings, it comes with the following options:
> realtime-stt-server.py --help
usage: realtime-stt-server.py [-h] [--host HOST] [--port PORT] [--device DEVICE] [--model MODEL]
[--model-realtime MODEL_REALTIME] [--language LANGUAGE] [--debug]
options:
-h, --help show this help message and exit
--host HOST The host to listen on [default: 'localhost']
--port PORT The port to listen on [default: 7007]
--device DEVICE Device to run the models on, defaults to cuda if available, else cpu [default: 'cuda']
--model MODEL Main model used to generate the final transcription [default: 'large-v3']
--model-realtime MODEL_REALTIME
Faster model used to generate live transcriptions [default: 'base']
--language LANGUAGE Set the spoken language. Leave empty to auto-detect. [default: '']
--debug Enable debug log output [default: unset]
The actual overlay can also be customized, for example by providing your own gtk style (refer to the builtin style.css as a reference), or by changing the hotkey. It has the following options:
> whisper-overlay overlay --help
Usage: whisper-overlay overlay [OPTIONS]
Options:
-a, --address <ADDRESS> The address of the the whisper streaming instance (host:port) [default: localhost:7007]
-s, --style <STYLE> An optional stylesheet for the overlay, which replaces the internal style
--hotkey <HOTKEY> Specifies the hotkey to activate voice input. You can use any key or button name from [evdev::Key](https://docs.rs/evdev/latest/evdev/struct.Key.html) [default: KEY_RIGHTCTRL]
-h, --help Print help
For a quick and simple install, you can run the server using docker and install the overlay directly via cargo:
git clone https://github.com/oddlama/whisper-overlay
cd whisper-overlay
# Start realtime-stt-server
docker-compose up
# Install and run overlay
cargo install whisper-overlay
whisper-overlay overlay
This application comes with a NixOS module and overlay so you can easily access the relevant packages and host the realtime-stt-server. First, add this flake as an input:
{
inputs = {
# ...
whisper-overlay.url = "github:oddlama/whisper-overlay";
whisper-overlay.inputs.nixpkgs.follows = "nixpkgs";
};
}
Then add the nixos module exposed by this flake,
and enable the realtime-stt-server in your configuration.nix
. Also add the relevant package to your system or user,
so you can start it later.
{
imports = [
inputs.whisper-overlay.nixosModules.default
];
# Also make sure to enable cuda support in nixpkgs, otherwise transcription will
# be painfully slow. But be prepared to let your computer build packages for 2-3 hours.
nixpkgs.config.cudaSupport = true;
services.realtime-stt-server.enable = true;
environment.systemPackages = [pkgs.whisper-overlay];
}
The server will now be started automatically with your system,
and you can run whisper-overlay overlay
as your user.
You might want to add this.
First, install and start the server:
# Create virtualenv
python -m venv venv
source venv/bin/activate
# Install RealtimeSTT (fork)
# Follow this for GPU support:
# https://github.com/KoljaB/RealtimeSTT?tab=readme-ov-file#gpu-support-with-cuda-recommended
git clone https://github.com/oddlama/RealtimeSTT
cd RealtimeSTT
pip install -r requirements.txt
cd ..
# Run server script
git clone https://github.com/oddlama/whisper-overlay
python ./realtime-stt-server.py
Second, start the overlay by tunning the client from source:
# Clone repository (or reuse the previous checkout)
git clone https://github.com/oddlama/whisper-overlay
cargo build --release
./target/release/whisper-overlay overlay
The whisper-overlay natively supports a waybar status command to display the server status in your waybar.
Add this to your waybar config:
"custom/whisper_overlay": {
"escape": true,
"exec": "/path/to/whisper-overlay waybar-status",
"format": "{icon} {}",
"format-icons": {
"disconnected": "<span foreground='gray'></span>",
"connected": "<span foreground='#4ab0fa'></span>",
"connected-active": "<span foreground='red'></span>"
},
"return-type": "json",
"tooltip": true
},
And instanciate the module somewhere:
"modules-left": [
// ...
"custom/whisper_overlay"
// ...
],
Currently, you need to use my fork of RealtimeSTT which allows the client to read token probabilities and fixes some shutdown issues. Already requested this to be upstreamed, so hopefully this won't be required for long.
The provided realtime-stt-server
implementation allows you to host the server either locally on your machine, or on another machine
in your network. Our end of the implementation is techincally ready for multiple clients, but due to the way RealtimeSTT
works, it cannot process
multiple requests simultaneously at this point in time. So you will have to wait for other clients to disconnect before your transcription can begin.
Currently, this project requires the use of a wayland compositor that supports the layer-shell and virtual-keyboard-v1 protocol extensions. Thus it should work out-of-the-box on any wlroots based compositor (sway, ...) and on hyprland. X11 support is currently not planned. There is a branch with a partial implementation for X11, but getting GTK4 to create a reliable overlay window has proven to be hard and auto-type doesn't work properly with enigo (the rust library in use for virtual input). But I'm of course happy to accept contributions in that regard if someone knows how to address the remaining issues.
The global hotkey is detected using evdev
, since I didn't manage to get the GlobalShortcuts desktop portal
to work with windows using the layer-shell protocol (related issue).
In the future this might change, but for now your user must be in the input
group for this to work.
Licensed under the MIT license (LICENSE or https://opensource.org/licenses/MIT). Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project by you, shall be licensed as above, without any additional terms or conditions.