[Installation and Usage](#-installation-and-usage) ## 💬 whisper-overlay A wayland overlay providing speech-to-text functionality for any application via a global push-to-talk hotkey. Anything you are saying while holding the hotkey will be transcribed in real-time and shown on-screen. The live transcriptions use a faster but less accurate model but as soon as you pause speaking or release the hotkey, the transcription will be updated using a second, more accurate model. This resulting text will then be tryped into the window that is currently focused. - On-screen, realtime live transcriptions via CUDA and faster-whisper - The server-client based architecture allows you to host the model on another machine - Native waybar integration for status display - Utilizes `layer-shell` and `virtual-keyboard-v1` to support most wayland compositors This makes use of the [RealtimeSTT](https://github.com/KoljaB/RealtimeSTT) python library to provide live transcriptions, which in turn uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper) for both the actual realtime and high-fidelity transcription model. Requirements: - A wayland compositor (sway, hyprland, ...) - A GPU with CUDA support is highly recommended, otherwise translation will have a significantly latency even on a modern CPU (1 second latency for live transcription and ~5 seconds for the result) ## 🚀 Quick Start - Clone the repository ``` git clone https://github.com/oddlama/whisper-overlay cd whisper-overlay ``` - Run the realtime-stt-server using docker ``` docker-compose up ``` - Install and run whisper-overlay ``` cargo install whisper-overlay whisper-overlay overlay # Or alternatively select a hotkey: #whisper-overlay overlay --hotkey KEY_F12 ``` Now press and hold Right Ctrl to transcribe. For a permanent installation I recommend starting the server as a systemd service and adding the `whisper-overlay overlay` as a startup command to your desktop environment / compositor. ## ⚙️ Usage In principle you just need to start `./realtime-stt-server.py` and it will be listening for requests on `localhost:7007`. You can then start `whisper-overlay overlay` to transcribe text. The default hotkey is Right Ctrl, but you can change this by specifying any name from [evdev::Key](https://docs.rs/evdev/latest/evdev/struct.Key.html), for example `KEY_F12` for F12. Beware that the hotkey is only observed and will still be passed to the application that is focused. #### Server (realtime-stt-server) If you want to change the server settings, it comes with the following options: ```bash > realtime-stt-server.py --help usage: realtime-stt-server.py [-h] [--host HOST] [--port PORT] [--device DEVICE] [--model MODEL] [--model-realtime MODEL_REALTIME] [--language LANGUAGE] [--debug] options: -h, --help show this help message and exit --host HOST The host to listen on [default: 'localhost'] --port PORT The port to listen on [default: 7007] --device DEVICE Device to run the models on, defaults to cuda if available, else cpu [default: 'cuda'] --model MODEL Main model used to generate the final transcription [default: 'large-v3'] --model-realtime MODEL_REALTIME Faster model used to generate live transcriptions [default: 'base'] --language LANGUAGE Set the spoken language. Leave empty to auto-detect. [default: ''] --debug Enable debug log output [default: unset] ``` #### Client (whisper-overlay) The actual overlay can also be customized, for example by providing your own gtk style (refer to [the builtin style.css](./src/style.css) as a reference), or by changing the hotkey. It has the following options: ```bash > whisper-overlay overlay --help Usage: whisper-overlay overlay [OPTIONS] Options: -a, --address
The address of the the whisper streaming instance (host:port) [default: localhost:7007] -s, --style