# sttx
Utility belt for transforming speech-to-text data. Pronounced "sticks".
## Use cases
### Working with `whisper.cpp` output
[`whisper.cpp`](https://github.com/ggerganov/whisper.cpp) is a fantastic piece
of software offering state-of-the-art speech-to-text capability. It is a fairly
low-level program, and its output is not fully configurable. Given an audio
file as input, it can produce text in CSV, SRT, or plain text formats,
including timestamps.
The resolution of the data it gives you is controllable via the max length flag
(`-ml`). Note that the unit of length is tokens unless the split on word flag
(`-sow`) is enabled.
`whisper.cpp -i -ml 1 -sow`
However, at best this only allows us to constrain the output by accumulating
chunks of N words.
`sttx` stakes its utility on the notion that even with no other additional
context, one can transform timestamped STT data into more useful
representations.
At its core, it offers stackable strategies for reducing a sequence of
timestamped speech events to a single event, taking advantage of the fact that
these events (start/end timestamps and text content) are a semigroup (e.g.
something that can be added to itself).
The strategies include:
- `--sentences`: Until the next sentence ending
- `--lasting`: Concatenating until a certain duration has been reached
- `--max-silence`: Until the summed total duration of the gaps in events exceeds the given amount
- `--by-gap`: Until the gap between this event and the next one exceeds the given amount
- `--min-word-count`: Until the total word count of the result exceeds the given figure
- `--chunk-size`: The next N events
For example, if you have a sequence of events like this:
```csv
start,end,text
0,1000, Hel
1000,1100,lo
1100,2000, world
2000,2000,!
2500,3000, How
3100,3500, are
4100,5000, you
5000,5000,?
6300,6700, I'm
6800,7200, fine
7200,7200,","
7300,7500, thanks
7500,7500,!
```
By default, `sttx` combines events without leading whitespace to the previous
event. So with no arguments, the expected output would be:
```csv
start,end,text
0,1100, Hello
1100,2000, world!
2500,3000, How
3100,3500, are
4100,5000, you?
6300,6700, I'm
6800,7200," fine,"
7300,7500, thanks!
```
With the `--sentences` flag, the output would be:
```csv
start,end,text
0,2000, Hello world!
2500,5000, How are you?
6300,7500," I'm fine, thanks!"
```
And `--sentences --chunk-size 2` gives you:
```csv
start,end,text
0,5000, Hello world! How are you?
6300,7500," I'm fine, thanks!"
```
Other output formats are supported:
`--format json`:
```json
[
{
"start": 0,
"end": 5000,
"text": " Hello world! How are you?"
},
{
"start": 6300,
"end": 7500,
"text": " I'm fine, thanks!"
}
]
```
`--format srt`:
```srt
1
00:00:00,000 --> 00:00:05,000
Hello world! How are you?
2
00:00:06,300 --> 00:00:07,500
I'm fine, thanks!
```
## Usage
```txt
Usage: sttx transform [OPTIONS]
Arguments:
Options:
-i, --input-format
[default: csv-fix]
Possible values:
- csv-fix: same as csv, plus whisper.cpp formatting fix
- csv
- json
-f, --format
[default: pretty]
[possible values: csv, json, srt, pretty]
-o, --output
The path to which the program should write the output. Use `-` for stdout
[default: -]
--max-silence
Concatenates until the accumulated delay between events exceeds the given duration
-s, --sentences
Concatenates up to the next sentence ending ('.', '!', or '?')
-w, --min-word-count
Concatenates until the total word count of the result exceeds the given value
-g, --by-gap
Concatenates until the delay until the start of the next event exceeds the given duration
-l, --lasting
Concatenates until the total duration of the result exceeds the given value
-c, --chunk-size
Concatenates up to N events
-h, --help
Print help (see a summary with '-h')
```
As of this writing (2024-03-25), `whisper.cpp`'s CSV output does not appear to
escape double quotes correctly. This finding may be my own error, but if not
I'll file an issue.