halldyll_deploy_pods

Crates.iohalldyll_deploy_pods
lib.rshalldyll_deploy_pods
version0.1.0
created_at2026-01-20 13:27:41.366242+00
updated_at2026-01-20 13:27:41.366242+00
descriptionDeclarative, idempotent, and reconcilable deployment system for RunPod GPU pods
homepagehttps://github.com/Mr-soloDev/halldyll_deploy_pods
repositoryhttps://github.com/Mr-soloDev/halldyll_deploy_pods
max_upload_size
id2056548
size385,201
(Mr-soloDev)

documentation

https://docs.rs/halldyll_deploy_pods

README

Halldyll Deploy Pods

Crates.io Documentation License: MIT Rust

A declarative, idempotent, and reconcilable deployment system for RunPod GPU pods.

Think of it as Terraform/Kubernetes for RunPod — define your GPU infrastructure as code, and let Halldyll handle the rest.

Features

  • Declarative — Define your infrastructure in a simple YAML file
  • Idempotent — Run apply multiple times, get the same result
  • Drift Detection — Automatically detect and fix configuration drift
  • Reconciliation Loop — Continuously converge to desired state
  • State Management — Track deployments locally or on S3
  • Multi-environment — Support for dev, staging, prod environments
  • Guardrails — Cost limits, GPU limits, TTL auto-stop
  • Auto Model Download — Automatically download HuggingFace models on pod startup
  • Inference Engines — Auto-start vLLM, TGI, or Ollama with your models

Installation

From Crates.io

cargo install halldyll_deploy_pods

From Source

git clone https://github.com/Mr-soloDev/halldyll_deploy_pods.git
cd halldyll_deploy_pods
cargo install --path .

Quick Start

1. Initialize a new project

halldyll init my-project
cd my-project

2. Configure your deployment

Edit halldyll.deploy.yaml:

project:
  name: "my-ml-stack"
  environment: "prod"
  cloud_type: SECURE

state:
  backend: local

pods:
  - name: "inference"
    gpu:
      type: "NVIDIA A40"
      count: 1
    runtime:
      image: "vllm/vllm-openai:latest"
      env:
        MODEL_NAME: "meta-llama/Llama-3-8B"
    ports:
      - "8000/http"
    volumes:
      - name: "hf-cache"
        mount: "/root/.cache/huggingface"
        persistent: true

3. Set your RunPod API key

export RUNPOD_API_KEY="your-api-key"

4. Deploy!

halldyll plan      # Preview changes
halldyll apply     # Deploy to RunPod
halldyll status    # Check deployment status

Commands

Command Description
halldyll init [path] Initialize a new project
halldyll validate Validate configuration file
halldyll plan Show deployment plan (dry-run)
halldyll apply Apply the deployment plan
halldyll status Show current deployment status
halldyll reconcile Auto-fix drift from desired state
halldyll drift Detect configuration drift
halldyll destroy Destroy all deployed resources
halldyll logs <pod> View pod logs
halldyll state Manage deployment state

Configuration Reference

Project Configuration

project:
  name: "my-project"          # Required: unique project name
  environment: "dev"          # Optional: dev, staging, prod (default: dev)
  region: "EU"                # Optional: EU, US, etc.
  cloud_type: SECURE          # Optional: SECURE or COMMUNITY
  compute_type: GPU           # Optional: GPU or CPU

State Backend

state:
  backend: local              # local or s3
  # For S3:
  bucket: "my-state-bucket"
  prefix: "halldyll/my-project"
  region: "us-east-1"

Pod Configuration

pods:
  - name: "my-pod"
    gpu:
      type: "NVIDIA A40"      # GPU type
      count: 1                # Number of GPUs
      min_vram_gb: 40         # Optional: minimum VRAM
      fallback:               # Optional: fallback GPU types
        - "NVIDIA L40S"
        - "NVIDIA RTX A6000"
    
    ports:
      - "22/tcp"              # SSH
      - "8000/http"           # HTTP endpoint
    
    volumes:
      - name: "data"
        mount: "/data"
        persistent: true
        size_gb: 100
    
    runtime:
      image: "runpod/pytorch:2.1.0-py3.10-cuda11.8.0"
      env:
        MY_VAR: "value"
    
    health_check:
      endpoint: "/health"
      port: 8000
      interval_secs: 30
      timeout_secs: 5

Model Configuration (Auto-download and Start)

pods:
  - name: "llm-server"
    gpu:
      type: "NVIDIA A40"
      count: 1
    runtime:
      image: "vllm/vllm-openai:latest"
    ports:
      - "8000/http"
    
    # Models are automatically downloaded and engines started
    models:
      - id: "llama-3-8b"
        provider: huggingface           # huggingface, bundle, or custom
        repo: "meta-llama/Meta-Llama-3-8B-Instruct"
        load:
          engine: vllm                  # vllm, tgi, ollama, or transformers
          quant: awq                    # Optional: awq, gptq, fp8
          max_seq_len: 8192             # Optional: max sequence length
          options:                      # Optional: engine-specific options
            tensor-parallel-size: 1

Supported Inference Engines

Engine Description Auto-Start Use Case
vllm High-performance LLM serving Yes Production LLM APIs, OpenAI-compatible
tgi HuggingFace Text Generation Inference Yes HuggingFace models, streaming
ollama Easy-to-use LLM runner Yes Local development, quick testing
transformers HuggingFace Transformers library No Custom scripts, fine-tuning

Multi-Model Deployment Example

Deploy different models on different pods:

pods:
  # LLM API Server
  - name: "llm-api"
    gpu:
      type: "NVIDIA A40"
      count: 1
    runtime:
      image: "vllm/vllm-openai:latest"
    ports:
      - "8000/http"
    models:
      - id: "llama-3-8b"
        provider: huggingface
        repo: "meta-llama/Meta-Llama-3-8B-Instruct"
        load:
          engine: vllm
          max_seq_len: 8192

  # Embedding Server
  - name: "embeddings"
    gpu:
      type: "NVIDIA RTX 4090"
      count: 1
    runtime:
      image: "ghcr.io/huggingface/text-embeddings-inference:latest"
    ports:
      - "8080/http"
    models:
      - id: "bge-large"
        provider: huggingface
        repo: "BAAI/bge-large-en-v1.5"
        load:
          engine: tgi

  # Vision Model
  - name: "vision-api"
    gpu:
      type: "NVIDIA A40"
      count: 1
    runtime:
      image: "ghcr.io/huggingface/text-generation-inference:latest"
    ports:
      - "8000/http"
    models:
      - id: "llava"
        provider: huggingface
        repo: "llava-hf/llava-v1.6-mistral-7b-hf"
        load:
          engine: tgi

Quantization Options

Reduce memory usage with quantization:

models:
  - id: "llama-70b-awq"
    provider: huggingface
    repo: "TheBloke/Llama-2-70B-Chat-AWQ"
    load:
      engine: vllm
      quant: awq              # 4-bit AWQ quantization
      max_seq_len: 4096
Quant Method Memory Reduction Quality Speed
awq ~75% High Fast
gptq ~75% High Medium
fp8 ~50% Very High Fast

Guardrails (Optional)

guardrails:
  max_hourly_cost: 10.0       # Maximum hourly cost in USD
  max_gpus: 4                 # Maximum total GPUs
  ttl_hours: 24               # Auto-stop after N hours
  allow_gpu_fallback: false   # Allow fallback to other GPU types

Architecture

┌─────────────────────────────────────────────────────────┐
│              halldyll.deploy.yaml                       │
│                 (Desired State)                         │
└───────────────────────┬─────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────┐
│            ConfigParser + Validator                     │
└───────────────────────┬─────────────────────────────────┘
                        │
         ┌──────────────┴──────────────┐
         ▼                             ▼
┌─────────────────┐          ┌─────────────────┐
│   StateStore    │          │  PodObserver    │
│ (Local or S3)   │          │ (RunPod API)    │
└────────┬────────┘          └────────┬────────┘
         │                            │
         └──────────────┬─────────────┘
                        ▼
┌─────────────────────────────────────────────────────────┐
│                    DiffEngine                           │
│           (Compare Desired vs Observed)                 │
└───────────────────────┬─────────────────────────────────┘
                        ▼
┌─────────────────────────────────────────────────────────┐
│                  Reconciler                             │
│        (Execute Plan → Converge State)                  │
└─────────────────────────────────────────────────────────┘

Library Usage

You can also use Halldyll as a library in your Rust projects:

use halldyll_deploy_pods::{
    ConfigParser, ConfigValidator, DeployConfig,
    RunPodClient, PodProvisioner, PodObserver, PodExecutor,
    Reconciler, StateStore, LocalStateStore,
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Parse configuration
    let config = ConfigParser::parse_file("halldyll.deploy.yaml")?;
    
    // Validate
    ConfigValidator::validate(&config)?;
    
    // Create RunPod client
    let client = RunPodClient::new(&std::env::var("RUNPOD_API_KEY")?)?;
    
    // Create provisioner and deploy with auto model setup
    let provisioner = PodProvisioner::new(client.clone());
    let (pod, setup_result) = provisioner.create_pod_with_setup(
        &config.pods[0],
        &config.project,
        "config-hash"
    ).await?;
    
    // Check model setup results
    if let Some(result) = setup_result {
        println!("Setup: {}", result.summary());
    }
    
    Ok(())
}

Environment Variables

Variable Description Required
RUNPOD_API_KEY Your RunPod API key Yes
HF_TOKEN HuggingFace API token (for gated models like Llama) For gated models
HALLDYLL_CONFIG Path to config file No
AWS_ACCESS_KEY_ID AWS credentials (for S3 state) No
AWS_SECRET_ACCESS_KEY AWS credentials (for S3 state) No

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Geryan Roy (@Mr-soloDev)

Acknowledgments

  • RunPod for the amazing GPU cloud platform
  • Inspired by Terraform, Kubernetes, and other declarative infrastructure tools
Commit count: 7

cargo fmt