specds

Crates.io	specds
lib.rs	specds
version	0.1.0
created_at	2025-07-21 06:26:38.982637+00
updated_at	2025-07-21 06:26:38.982637+00
description	A spec-driven data science pipeline generator using LLMs
homepage
repository	https://github.com/renbytes/specds
max_upload_size
id	1761789
size	9,681,719

(bordumb)

documentation

README

Spec-Driven Data Science (`specds`)

A high-performance, enterprise-grade CLI tool written in Rust that generates complete, tested data science analysis pipelines from user specifications using LLMs.

Overview

This tool streamlines the process of creating boilerplate data analysis code. Instead of writing scripts from scratch, a user provides a high-level specification, and the tool generates the corresponding code in Python or SQL, complete with functions, tests, and best practices built-in.

It is inspired by this talk on spec-driven development by Sean Grove of OpenAI.

Core Features

Spec-Driven: Define your analysis using a simple configuration file or command-line flags.
User-Friendly: A simple init command gets you started in seconds.
Automation-Ready: Fully scriptable using command-line flags for CI/CD pipelines.
Multi-Language Support: Generate code for Python (pandas/PySpark) or SQL (dbt-style).
Intelligent Schema Handling: Provide sample data directly or connect to a live database to have the tool automatically infer the schema.
Enterprise-Grade: Built with Rust for performance, reliability, and security.

Prerequisites

Rust (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh)
An OpenAI API Key and/or Gemini API Key

Quick Start

1. Clone the Repository

git clone git@github.com:renbytes/specds.git
cd specds

2. Configure Environment Variables

Copy the example .env file and add your OpenAI API key.

cp .env.example .env
# Edit .env and add your key:
# OPENAI_API_KEY="sk-..."
# GEMINI_API_KEY="AI..."

3. Build the Project

Compile the project in release mode for optimal performance.

cargo build --release

4. Install Globally

Make the command globally available:

sudo cp ./target/release/specds /usr/local/bin/

Usage

There are two primary ways to use this tool: the simple File-Based Workflow (recommended for getting started) and the powerful Flag-Based Workflow (ideal for automation).

Workflow 1: File-Based (Recommended)

This is the easiest way to get started.

Step 1: Initialize a spec file

specds init

This creates a spec.toml file with helpful comments and examples.

Step 2: Edit spec.toml

Open the newly created spec.toml file and fill in your analysis details:

# spec.toml
language = "Python"
analysis_type = "Simple Aggregation"
description = "A weekly report on new user signups."

[[dataset]]
name = "user_events"
description = "Primary input dataset."
sample_data_path = "path/to/your/sample_data.csv"

[[metric]]
name = "new_signups"
logic = "Users where event_type is 'signup' and is_new_user is true"
aggregation = "CountDistinct"
aggregation_field = "user_id"

Step 3: Generate the code

specds generate --spec spec.toml --provider gemini --model gemini-2.5-pro

Note: you need to pick a LLM provider and model. Currently OpenAI and Gemini (Google) are supported.

Workflow 2: Flag-Based (Automation)

This method is ideal for scripting and automation:

specds generate \
  --language python \
  --description "A weekly report on new user signups." \
  --analysis-type "Simple Aggregation" \
  --dataset-name "user_events" \
  --sample-data-path ./sample_data.csv \
  --metric-name "new_signups" \
  --metric-logic "Users where event_type is 'signup' and is_new_user is true" \
  --aggregation count-distinct \
  --aggregation-field "user_id"

Output

Either workflow creates a new directory inside generated_jobs/ with a timestamp, containing your complete, tested analysis pipeline:

generated_jobs/
└── python/
    └── simple-aggregation/
        └── 20250720-193000__a-weekly-report-on-new-user-signups/
            ├── job.py
            ├── functions.py
            ├── tests/
            │   ├── test_job.py
            │   └── test_functions.py
            └── README.md

Examples

Explore real-world use cases in the examples/ directory:

E-commerce - Top selling products analysis (SQL)
Healthcare - Patient length of stay analysis (SQL)
Finance - Stock volatility calculation (Python)
Energy - Renewable energy production analysis (Python)
Consumer Tech - Ad attribution pipeline (PySpark)

Supported Languages

Language	Framework	Use Case
Python	pandas	Data analysis, reporting
PySpark	Spark	Big data, distributed computing
SQL	dbt-style	Data warehousing, analytics

Development

Running Checks

To ensure code quality, run the following commands:

Format: cargo fmt
Lint: cargo clippy -- -D warnings
Test: cargo test

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT OR Apache-2.0 license.

Support

📖 Documentation: Check the examples/ directory for detailed use cases
🐛 Issues: Report bugs on GitHub Issues
💬 Discussions: Join conversations on GitHub Discussions

Commit count: 0