| Crates.io | halldyll_deploy_pods |
| lib.rs | halldyll_deploy_pods |
| version | 0.1.0 |
| created_at | 2026-01-20 13:27:41.366242+00 |
| updated_at | 2026-01-20 13:27:41.366242+00 |
| description | Declarative, idempotent, and reconcilable deployment system for RunPod GPU pods |
| homepage | https://github.com/Mr-soloDev/halldyll_deploy_pods |
| repository | https://github.com/Mr-soloDev/halldyll_deploy_pods |
| max_upload_size | |
| id | 2056548 |
| size | 385,201 |
A declarative, idempotent, and reconcilable deployment system for RunPod GPU pods.
Think of it as Terraform/Kubernetes for RunPod — define your GPU infrastructure as code, and let Halldyll handle the rest.
apply multiple times, get the same resultcargo install halldyll_deploy_pods
git clone https://github.com/Mr-soloDev/halldyll_deploy_pods.git
cd halldyll_deploy_pods
cargo install --path .
halldyll init my-project
cd my-project
Edit halldyll.deploy.yaml:
project:
name: "my-ml-stack"
environment: "prod"
cloud_type: SECURE
state:
backend: local
pods:
- name: "inference"
gpu:
type: "NVIDIA A40"
count: 1
runtime:
image: "vllm/vllm-openai:latest"
env:
MODEL_NAME: "meta-llama/Llama-3-8B"
ports:
- "8000/http"
volumes:
- name: "hf-cache"
mount: "/root/.cache/huggingface"
persistent: true
export RUNPOD_API_KEY="your-api-key"
halldyll plan # Preview changes
halldyll apply # Deploy to RunPod
halldyll status # Check deployment status
| Command | Description |
|---|---|
halldyll init [path] |
Initialize a new project |
halldyll validate |
Validate configuration file |
halldyll plan |
Show deployment plan (dry-run) |
halldyll apply |
Apply the deployment plan |
halldyll status |
Show current deployment status |
halldyll reconcile |
Auto-fix drift from desired state |
halldyll drift |
Detect configuration drift |
halldyll destroy |
Destroy all deployed resources |
halldyll logs <pod> |
View pod logs |
halldyll state |
Manage deployment state |
project:
name: "my-project" # Required: unique project name
environment: "dev" # Optional: dev, staging, prod (default: dev)
region: "EU" # Optional: EU, US, etc.
cloud_type: SECURE # Optional: SECURE or COMMUNITY
compute_type: GPU # Optional: GPU or CPU
state:
backend: local # local or s3
# For S3:
bucket: "my-state-bucket"
prefix: "halldyll/my-project"
region: "us-east-1"
pods:
- name: "my-pod"
gpu:
type: "NVIDIA A40" # GPU type
count: 1 # Number of GPUs
min_vram_gb: 40 # Optional: minimum VRAM
fallback: # Optional: fallback GPU types
- "NVIDIA L40S"
- "NVIDIA RTX A6000"
ports:
- "22/tcp" # SSH
- "8000/http" # HTTP endpoint
volumes:
- name: "data"
mount: "/data"
persistent: true
size_gb: 100
runtime:
image: "runpod/pytorch:2.1.0-py3.10-cuda11.8.0"
env:
MY_VAR: "value"
health_check:
endpoint: "/health"
port: 8000
interval_secs: 30
timeout_secs: 5
pods:
- name: "llm-server"
gpu:
type: "NVIDIA A40"
count: 1
runtime:
image: "vllm/vllm-openai:latest"
ports:
- "8000/http"
# Models are automatically downloaded and engines started
models:
- id: "llama-3-8b"
provider: huggingface # huggingface, bundle, or custom
repo: "meta-llama/Meta-Llama-3-8B-Instruct"
load:
engine: vllm # vllm, tgi, ollama, or transformers
quant: awq # Optional: awq, gptq, fp8
max_seq_len: 8192 # Optional: max sequence length
options: # Optional: engine-specific options
tensor-parallel-size: 1
| Engine | Description | Auto-Start | Use Case |
|---|---|---|---|
vllm |
High-performance LLM serving | Yes | Production LLM APIs, OpenAI-compatible |
tgi |
HuggingFace Text Generation Inference | Yes | HuggingFace models, streaming |
ollama |
Easy-to-use LLM runner | Yes | Local development, quick testing |
transformers |
HuggingFace Transformers library | No | Custom scripts, fine-tuning |
Deploy different models on different pods:
pods:
# LLM API Server
- name: "llm-api"
gpu:
type: "NVIDIA A40"
count: 1
runtime:
image: "vllm/vllm-openai:latest"
ports:
- "8000/http"
models:
- id: "llama-3-8b"
provider: huggingface
repo: "meta-llama/Meta-Llama-3-8B-Instruct"
load:
engine: vllm
max_seq_len: 8192
# Embedding Server
- name: "embeddings"
gpu:
type: "NVIDIA RTX 4090"
count: 1
runtime:
image: "ghcr.io/huggingface/text-embeddings-inference:latest"
ports:
- "8080/http"
models:
- id: "bge-large"
provider: huggingface
repo: "BAAI/bge-large-en-v1.5"
load:
engine: tgi
# Vision Model
- name: "vision-api"
gpu:
type: "NVIDIA A40"
count: 1
runtime:
image: "ghcr.io/huggingface/text-generation-inference:latest"
ports:
- "8000/http"
models:
- id: "llava"
provider: huggingface
repo: "llava-hf/llava-v1.6-mistral-7b-hf"
load:
engine: tgi
Reduce memory usage with quantization:
models:
- id: "llama-70b-awq"
provider: huggingface
repo: "TheBloke/Llama-2-70B-Chat-AWQ"
load:
engine: vllm
quant: awq # 4-bit AWQ quantization
max_seq_len: 4096
| Quant Method | Memory Reduction | Quality | Speed |
|---|---|---|---|
awq |
~75% | High | Fast |
gptq |
~75% | High | Medium |
fp8 |
~50% | Very High | Fast |
guardrails:
max_hourly_cost: 10.0 # Maximum hourly cost in USD
max_gpus: 4 # Maximum total GPUs
ttl_hours: 24 # Auto-stop after N hours
allow_gpu_fallback: false # Allow fallback to other GPU types
┌─────────────────────────────────────────────────────────┐
│ halldyll.deploy.yaml │
│ (Desired State) │
└───────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ ConfigParser + Validator │
└───────────────────────┬─────────────────────────────────┘
│
┌──────────────┴──────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ StateStore │ │ PodObserver │
│ (Local or S3) │ │ (RunPod API) │
└────────┬────────┘ └────────┬────────┘
│ │
└──────────────┬─────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ DiffEngine │
│ (Compare Desired vs Observed) │
└───────────────────────┬─────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ Reconciler │
│ (Execute Plan → Converge State) │
└─────────────────────────────────────────────────────────┘
You can also use Halldyll as a library in your Rust projects:
use halldyll_deploy_pods::{
ConfigParser, ConfigValidator, DeployConfig,
RunPodClient, PodProvisioner, PodObserver, PodExecutor,
Reconciler, StateStore, LocalStateStore,
};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Parse configuration
let config = ConfigParser::parse_file("halldyll.deploy.yaml")?;
// Validate
ConfigValidator::validate(&config)?;
// Create RunPod client
let client = RunPodClient::new(&std::env::var("RUNPOD_API_KEY")?)?;
// Create provisioner and deploy with auto model setup
let provisioner = PodProvisioner::new(client.clone());
let (pod, setup_result) = provisioner.create_pod_with_setup(
&config.pods[0],
&config.project,
"config-hash"
).await?;
// Check model setup results
if let Some(result) = setup_result {
println!("Setup: {}", result.summary());
}
Ok(())
}
| Variable | Description | Required |
|---|---|---|
RUNPOD_API_KEY |
Your RunPod API key | Yes |
HF_TOKEN |
HuggingFace API token (for gated models like Llama) | For gated models |
HALLDYLL_CONFIG |
Path to config file | No |
AWS_ACCESS_KEY_ID |
AWS credentials (for S3 state) | No |
AWS_SECRET_ACCESS_KEY |
AWS credentials (for S3 state) | No |
Contributions are welcome! Please feel free to submit a Pull Request.
git checkout -b feature/amazing-feature)git commit -m 'Add some amazing feature')git push origin feature/amazing-feature)This project is licensed under the MIT License - see the LICENSE file for details.
Geryan Roy (@Mr-soloDev)