| Crates.io | awful_dataset_builder |
| lib.rs | awful_dataset_builder |
| version | 0.1.3 |
| created_at | 2025-10-07 02:46:06.813977+00 |
| updated_at | 2025-10-07 02:50:56.579134+00 |
| description | Build LLM-ready Q/A datasets from reference text-to-question mappings produced by Awful Knowledge Synthesizer. |
| homepage | https://github.com/graves/awful_dataset_builder |
| repository | https://github.com/graves/awful_dataset_builder |
| max_upload_size | |
| id | 1871168 |
| size | 158,056 |
โTurn your study notes into interrogation scripts for robots.โ
__
_____....--' .'
___...---'._ o -`(
___...---' \ .--. `\
___...---' | \ \ `|
| |o o | | |
| \___'.-`. '.
| | `---'
'^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^' LGB
ฮป awful_dataset_builder --help
Generate final exam questions from YAML book chunks
Usage: awful_dataset_builder --dir <DIR> --config <CONFIG> --start <START> --source-type <SOURCE_TYPE>
Options:
-d, --dir <DIR> Path to directory of .yaml book files
-c, --config <CONFIG> Configuration file
-s, --start <START> Start processing file from this chunk
--source-type <SOURCE_TYPE> Source type [possible values: book, manpage, mdbook, tealdeer, code]
-h, --help Print help
awful_dataset_builder is a command-line tool that takes structured YAML files (from awkward_knowledge_synthesizer) and generates question-answer pairs using Large Language Models (LLMs). It's your go-to tool for building datasets for finetuning LLMs.
awful_aj.math_questions.yaml).ฮป awful_dataset_builder --dir ./books --config config.yaml --source_type Book --start 1
--dir: Path to YAML files (e.g., books/).--config: Configuration file for LLM API (OpenAI, etc.).--source_type: Choose from: Book, Manpage, Mdbook, Tealdeer, or Code.--start: Skip files from this chunk (useful for parallel processing).For a book YAML file:
title: "Calculus for Dummies"
chunks:
- "What is the derivative of f(x) = xยฒ?"
Output:
- prompt: "Here is some reference text:\n\nWhat is the derivative of f(x) = xยฒ?"
answer: "The derivative of $ f(x) = x^2 $ is $ 2x $."
awful_aj.math_questions.yaml).| Source Type | Description |
|---|---|
Book |
YAML files with questions generated from book excerpts (e.g., "Title: Math for Dummies") |
Manpage |
YAML files with questions generated from manpages excerpts |
Mdbook |
YAML files withquestions generated from Markdown excerpts (mdbook built documentation) |
Tealdeer |
YAML files with questions generated from Command-line snippets (tldr commands) |
Code |
YAML files with questions generated from C, Rust, or Assembly source code repositories (language-aware tokenization) |
clap for CLI parsing.serde, tokio, and regex.MAX_RETRIES = 5).Building datasets is the most dificult, time-consuming labor involved with the Synthetic Finetuning of LLMs. A well thought out workflow using Awful Book Sanitizer, Awful Knowledge Synthesizer, and Awful Dataset Builder will allow you to experiment with your wildest curiosities about human language, on the cutting edge of technological advancement for as long as written language exists ๐
You can find Open Source datasets I've generated using these tools on Huggingface.