| Crates.io | awful_book_sanitizer |
| lib.rs | awful_book_sanitizer |
| version | 0.1.3 |
| created_at | 2025-10-07 02:48:14.089882+00 |
| updated_at | 2025-10-07 02:52:30.539083+00 |
| description | CLI to clean up OCR-mangled book excerpts into readable text using OpenAI-compatible APIs. |
| homepage | https://github.com/graves/awful_book_sanitizer |
| repository | https://github.com/graves/awful_book_sanitizer |
| max_upload_size | |
| id | 1871169 |
| size | 159,198 |
A Rust program that leverages OpenAI-compatible APIs to turn OCR-mangled books into readable, sane text. Because nobody wants to read the literal results of a neural network.
o . _ .
. (_) o
o ____ _ o
_ ,-/ /))) . o (_) .
(_) \_\ ( e( O _
o \/' _/ ,_ , o o (_)
. O _/ (_ / _/ . , o
o8o/ \\_/ / ,-. ,oO8/( -TT
o8o8O | } } / / \Oo8OOo8Oo|| O
Oo(""o8"""""""""""""""8oo""""""")
_ `\`' `' /' o
(_) \ / _ .
O \ _ / (_)
o . `-. .----<(o)_--. .-'
--------(_/------(_<_/--\_)--------hjw
λ awful_book_sanitizer --help
Clean up excerpts from books formatted as txt
Usage: awful_book_sanitizer [OPTIONS] --input <INPUT_DIR> --output <OUTPUT_DIR>
Options:
-i, --input <INPUT_DIR> Path to directory of txt files
-o, --output <OUTPUT_DIR> Path to directory where yaml files will be written
--config <CONFIG>... Configuration files (can specify multiple)
-h, --help Print help
This is awful_book_sanitizer, a command-line tool designed to clean up text excerpts from books that were too spooky for OCR.
Key features:
Despite its ominous name, it's actually pretty awesome.
.txt files (probably from OCR).Think of it as a magical wand that turns "This is a really bad word" into "That's actually the correct spelling."
awful_book_sanitizer --i /path/to/ocr-books --o /path/to/output --c llama-cpp-config.yaml google-colab-config.yaml
What this does:
-i = input directory (e.g., "ocr_books").-o = output directory (will create YAML files like book1.yaml).-c = configuration files for different LLMs.Bonus tip: If your books are "too spooky," try adding $RUST_LOG=info to see the program's internal monologue.
This is not just a sanitizer—it's a high-stakes collaboration between humans and AI.
*Just remember: The goal isn’t to fix every typo. It’s to make sure your books are at least legible.
tokio, serde, and clap.llama-cpp @ localhost or Google Colab).The hardest part about finetuning an LLM is managing all of the input and output files. 🤔
You’re welcome to email me with creative solutions.
Install dependencies:
cargo install awful_book_sanitizer
Run it:
awful_book_sanitizer -i books -o output -c configs.yaml
Check your YAML files:
cat results/book1.yaml
Please try to refrain from creating accidental heresy.
This program is a love letter to bad OCR and the power of LLMs. It’s not perfect, but it’s a step toward sanity for your books.
Remember: The goal isn’t to make the text perfect—it’s to make it useful enough.
Now go forth and sanitize! 🧪📚