awful_dataset_builder

Crates.ioawful_dataset_builder
lib.rsawful_dataset_builder
version0.1.3
created_at2025-10-07 02:46:06.813977+00
updated_at2025-10-07 02:50:56.579134+00
descriptionBuild LLM-ready Q/A datasets from reference text-to-question mappings produced by Awful Knowledge Synthesizer.
homepagehttps://github.com/graves/awful_dataset_builder
repositoryhttps://github.com/graves/awful_dataset_builder
max_upload_size
id1871168
size158,056
Thomas Gentry (graves)

documentation

https://docs.rs/awful_dataset_builder

README

๐Ÿ—๏ธ Awful Dataset Builder: Turn Reference Text/Exam Question mappings into Question/Answer pairs! ๐Ÿ“š

โ€œTurn your study notes into interrogation scripts for robots.โ€

                                           __
                               _____....--' .'
                     ___...---'._ o      -`(
           ___...---'            \   .--.  `\
 ___...---'                      |   \   \ `|
|                                |o o |  |  |
|                                 \___'.-`.  '.
|                                      |   `---'
'^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^' LGB
ฮป awful_dataset_builder --help
Generate final exam questions from YAML book chunks

Usage: awful_dataset_builder --dir <DIR> --config <CONFIG> --start <START> --source-type <SOURCE_TYPE>

Options:
  -d, --dir <DIR>                  Path to directory of .yaml book files
  -c, --config <CONFIG>            Configuration file
  -s, --start <START>              Start processing file from this chunk
      --source-type <SOURCE_TYPE>  Source type [possible values: book, manpage, mdbook, tealdeer, code]
  -h, --help                       Print help

๐Ÿง  What It Does

awful_dataset_builder is a command-line tool that takes structured YAML files (from awkward_knowledge_synthesizer) and generates question-answer pairs using Large Language Models (LLMs). It's your go-to tool for building datasets for finetuning LLMs.


๐ŸŽฏ Features

  • โœ… Multi-source support: Books, manpages, mdbooks, tealdeer (command-line snippets), and code files can be turned into exam questions by awful_knowledge_synthesizer.
  • ๐Ÿง  LLM-powered QA pairs: Fetches answers for final exam questions using awful_aj.
  • ๐Ÿ“„ YAML output: Saves results as structured YAML files (e.g., math_questions.yaml).
  • ๐Ÿ”„ Chunked processing: Splits text into chunks for robust LLM queries.

๐Ÿ“ฆ How to Use

๐Ÿ”ง Sample Command

ฮป awful_dataset_builder --dir ./books --config config.yaml --source_type Book --start 1
  • --dir: Path to YAML files (e.g., books/).
  • --config: Configuration file for LLM API (OpenAI, etc.).
  • --source_type: Choose from: Book, Manpage, Mdbook, Tealdeer, or Code.
  • --start: Skip files from this chunk (useful for parallel processing).

๐Ÿ“„ Example Output

For a book YAML file:

title: "Calculus for Dummies"
chunks:
  - "What is the derivative of f(x) = xยฒ?"

Output:

- prompt: "Here is some reference text:\n\nWhat is the derivative of f(x) = xยฒ?"
  answer: "The derivative of $ f(x) = x^2 $ is $ 2x $."

๐Ÿค“ How It Works

  1. Parse YAML: Extracts structured Reference Text to Final Exam Question mappings.
  2. LLM Query: Uses templates to generate questions and fetch answers via awful_aj.
  3. Output: Saves QA pairs in YAML format (e.g., math_questions.yaml).

๐Ÿ“š Supported Sources

Source Type Description
Book YAML files with questions generated from book excerpts (e.g., "Title: Math for Dummies")
Manpage YAML files with questions generated from manpages excerpts
Mdbook YAML files withquestions generated from Markdown excerpts (mdbook built documentation)
Tealdeer YAML files with questions generated from Command-line snippets (tldr commands)
Code YAML files with questions generated from C, Rust, or Assembly source code repositories (language-aware tokenization)

๐Ÿงช Implementation Notes

  • Uses clap for CLI parsing.
  • Relies on serde, tokio, and regex.
  • LLM queries are handled with exponential backoff (MAX_RETRIES = 5).

โค๏ธ Contributing

  • Report bugs or suggest improvements via GitHub Issues.
  • Fork and extend to support new source types!

โœจ Final Thoughts

Building datasets is the most dificult, time-consuming labor involved with the Synthetic Finetuning of LLMs. A well thought out workflow using Awful Book Sanitizer, Awful Knowledge Synthesizer, and Awful Dataset Builder will allow you to experiment with your wildest curiosities about human language, on the cutting edge of technological advancement for as long as written language exists ๐ŸŽ‰

You can find Open Source datasets I've generated using these tools on Huggingface.

Commit count: 0

cargo fmt