# Jupiter Search

[![Crates.io][crates-badge]][crates-url]
[![MIT licensed][mit-badge]][mit-url]
[![APACHE 2 licensed][apache-badge]][apache-url]
[![Build Status][actions-badge]][actions-url]

[crates-badge]: https://img.shields.io/crates/v/podcast2text.svg
[crates-url]: https://crates.io/crates/podcast2text
[mit-badge]: https://img.shields.io/badge/license-MIT-blue.svg
[mit-url]: https://github.com/FlakM/jupiter-search/blob/master/LICENSE-MIT
[apache-badge]: https://img.shields.io/badge/License-Apache_2.0-blue.svg
[apache-url]: https://github.com/FlakM/jupiter-search/blob/master/LICENSE-APACHE
[actions-badge]: https://github.com/flakm/jupiter-search/actions/workflows/build.yml/badge.svg
[actions-url]: https://github.com/FlakM/jupiter-search/actions


A showcase for indexing [jupiter network](https://www.jupiterbroadcasting.com/) podcasts using [meilisearch](https://www.meilisearch.com/).
This repository is build in order to provide possible solution to following problems:

- [search](https://github.com/JupiterBroadcasting/jupiterbroadcasting.com/issues/26)
- [transcription](https://github.com/JupiterBroadcasting/jupiterbroadcasting.com/issues/301)

**DISCLAIMER!**

Warning! This is a work in progress version to showcase how indexing/transcription might work.

## Overview

Project contains two main modules:

* `podcast2text` a cli tool for downloading RSS feed and transcribing podcast episodes 
* `search-load` a cli tool for loading obtained transcriptions to
  instance of meilisearch


## Building

To build you would need following packages on your system:

- cargo
- pkg-config
- openssl
- ffmpeg

There is a nix flake configured to ship build dependencies
just run `direnv allow` and run:

```shell
git submodule update --init --recursive
cargo build --release
```

To appease the gods of good taste please add following pre commit hook:

```
git config --local core.hooksPath .githooks
```

## Usage

### Run downloading podcasts

### Process audio from RSS feed


1. Download the whisper model

```shell
mkdir models
# this might be one of:
# "tiny.en" "tiny" "base.en" "base" "small.en" "small" "medium.en" "medium" "large"
model=medium.en
curl --output models/model.bin https://ggml.ggerganov.com/ggml-model-whisper-$model.bin
```

2. Run the inference on the RSS feed

```shell
# get information about the cli
docker run flakm/podcast2text --help

docker run \
    -v $PWD/models:/data/models \
    flakm/podcast2text \
    rss https://feed.jupiter.zone/allshows
```


### Install meilisearch

```shell
docker pull getmeili/meilisearch:v0.29
docker run -it --rm \
    -p 7700:7700 \
    -e MEILI_MASTER_KEY='MASTER_KEY'\
    -v $(pwd)/meili_data:/meili_data \
    getmeili/meilisearch:v0.29 \
    meilisearch --env="development"
```

### Run index creation and data loading

### Running inference of some audio

1. Download whisper model

```
mkdir models
# this might be one of:
# "tiny.en" "tiny" "base.en" "base" "small.en" "small" "medium.en" "medium" "large"
model=medium.en
curl --output models/ggml-$model.bin https://ggml.ggerganov.com/ggml-model-whisper-$model.bin
```
2. Download the example audio from rss feed

```
curl https://feed.jupiter.zone/link/19057/15745245/55bb5263-04be-43a3-8b92-678072a9cfc8.mp3 -L -o action.mp3
```

3. Install `ffmpeg` and put it on `PATH` variable.

4. Run the inference example

```
cargo run --release --example=get_transcript -- models/ggml-medium.en.bin action_short.wav | tee output.txt
```