shimmy

Crates.ioshimmy
lib.rsshimmy
version1.9.0
created_at2025-09-04 19:16:09.102735+00
updated_at2026-01-10 15:56:36.134969+00
descriptionLightweight sub-5MB Ollama alternative with native SafeTensors support. No Python dependencies, 2x faster loading. Now with GitHub Spec-Kit integration for systematic development.
homepagehttps://github.com/Michael-A-Kuykendall/shimmy
repositoryhttps://github.com/Michael-A-Kuykendall/shimmy
max_upload_size
id1824656
size867,110
Mike Kuykendall (Michael-A-Kuykendall)

documentation

README

Shimmy Logo

The Lightweight OpenAI API Server

๐Ÿ”’ Local Inference Without Dependencies ๐Ÿš€

License: MIT Security Crates.io Downloads Rust GitHub Stars

๐Ÿ’ Sponsor this project

Shimmy will be free forever. No asterisks. No "free for now." No pivot to paid.

๐Ÿ’ Support Shimmy's Growth

๐Ÿš€ If Shimmy helps you, consider sponsoring โ€” 100% of support goes to keeping it free forever.

  • $5/month: Coffee tier โ˜• - Eternal gratitude + sponsor badge
  • $25/month: Bug prioritizer ๐Ÿ› - Priority support + name in SPONSORS.md
  • $100/month: Corporate backer ๐Ÿข - Logo placement + monthly office hours
  • $500/month: Infrastructure partner ๐Ÿš€ - Direct support + roadmap input

๐ŸŽฏ Become a Sponsor | See our amazing sponsors ๐Ÿ™


Drop-in OpenAI API Replacement for Local LLMs

Shimmy is a single-binary that provides 100% OpenAI-compatible endpoints for GGUF models. Point your existing AI tools to Shimmy and they just work โ€” locally, privately, and free.

๐ŸŽ‰ NEW in v1.9.0: One download, all GPU backends included! No compilation, no backend confusion - just download and run.

Developer Tools

Whether you're forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.

Try it in 30 seconds

# 1) Download pre-built binary (includes all GPU backends)
# Windows:
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
./shimmy.exe serve &

# Linux:
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy
./shimmy serve &

# macOS (Apple Silicon):
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy
./shimmy serve &

# 2) See models and pick one
./shimmy list

# 3) Smoke test the OpenAI API
curl -s http://127.0.0.1:11435/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
        "model":"REPLACE_WITH_MODEL_FROM_list",
        "messages":[{"role":"user","content":"Say hi in 5 words."}],
        "max_tokens":32
      }' | jq -r '.choices[0].message.content'

๐Ÿš€ Compatible with OpenAI SDKs and Tools

No code changes needed - just change the API endpoint:

  • Any OpenAI client: Python, Node.js, curl, etc.
  • Development applications: Compatible with standard SDKs
  • VSCode Extensions: Point to http://localhost:11435
  • Cursor Editor: Built-in OpenAI compatibility
  • Continue.dev: Drop-in model provider

Use with OpenAI SDKs

  • Node.js (openai v4)
import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "http://127.0.0.1:11435/v1",
  apiKey: "sk-local", // placeholder, Shimmy ignores it
});

const resp = await openai.chat.completions.create({
  model: "REPLACE_WITH_MODEL",
  messages: [{ role: "user", content: "Say hi in 5 words." }],
  max_tokens: 32,
});

console.log(resp.choices[0].message?.content);
  • Python (openai>=1.0.0)
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local")

resp = client.chat.completions.create(
    model="REPLACE_WITH_MODEL",
    messages=[{"role": "user", "content": "Say hi in 5 words."}],
    max_tokens=32,
)

print(resp.choices[0].message.content)

โšก Zero Configuration Required

  • Automatically finds models from Hugging Face cache, Ollama, local dirs
  • Auto-allocates ports to avoid conflicts
  • Auto-detects LoRA adapters for specialized models
  • Just works - no config files, no setup wizards

๐Ÿง  Advanced MOE (Mixture of Experts) Support

Run 70B+ models on consumer hardware with intelligent CPU/GPU hybrid processing:

  • ๐Ÿ”„ CPU MOE Offloading: Automatically distribute model layers across CPU and GPU
  • ๐Ÿงฎ Intelligent Layer Placement: Optimizes which layers run where for maximum performance
  • ๐Ÿ’พ Memory Efficiency: Fit larger models in limited VRAM by using system RAM strategically
  • โšก Hybrid Acceleration: Get GPU speed where it matters most, CPU reliability everywhere else
  • ๐ŸŽ›๏ธ Configurable: --cpu-moe and --n-cpu-moe flags for fine control
# Enable MOE CPU offloading during installation
cargo install shimmy --features moe

# Run with MOE hybrid processing
shimmy serve --cpu-moe --n-cpu-moe 8

# Automatically balances: GPU layers (fast) + CPU layers (memory-efficient)

Perfect for: Large models (70B+), limited VRAM systems, cost-effective inference

๐ŸŽฏ Perfect for Local Development

  • Privacy: Your code never leaves your machine
  • Cost: No API keys, no per-token billing
  • Speed: Local inference, sub-second responses
  • Reliability: No rate limits, no downtime

Quick Start (30 seconds)

Installation

โœจ v1.9.0 NEW: Download pre-built binaries with ALL GPU backends included!

๐Ÿ“ฅ Pre-Built Binaries (Recommended - Zero Dependencies)

Pick your platform and download - no compilation needed:

# Windows x64 (includes CUDA + Vulkan + OpenCL)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe

# Linux x86_64 (includes CUDA + Vulkan + OpenCL)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy

# macOS ARM64 (includes MLX for Apple Silicon)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy

# macOS Intel (CPU-only)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-intel -o shimmy && chmod +x shimmy

# Linux ARM64 (CPU-only)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-aarch64 -o shimmy && chmod +x shimmy

That's it! Your GPU will be detected automatically at runtime.

๐Ÿ› ๏ธ Build from Source (Advanced)

Want to customize or contribute?

# Basic installation (CPU only)
cargo install shimmy --features huggingface

# Kitchen Sink builds (what pre-built binaries use):
# Windows/Linux x64:
cargo install shimmy --features huggingface,llama,llama-cuda,llama-vulkan,llama-opencl,vision

# macOS ARM64:
cargo install shimmy --features huggingface,llama,mlx,vision

# CPU-only (any platform):
cargo install shimmy --features huggingface,llama,vision

โš ๏ธ Build Notes:

  • Windows: Install LLVM first for libclang.dll
  • Recommended: Use pre-built binaries to avoid dependency issues
  • Advanced users only: Building from source requires C++ compiler + CUDA/Vulkan SDKs

GPU Acceleration

โœจ NEW in v1.9.0: One binary per platform with automatic GPU detection!

โš ๏ธ IMPORTANT - Vision Feature Performance:
CPU-based vision inference (MiniCPM-V) is 5-10x slower than GPU acceleration.
CPU: 15-45 seconds per image | GPU (CUDA/Vulkan): 2-8 seconds per image
For production vision workloads, GPU acceleration is strongly recommended.

๐Ÿ“ฅ Download Pre-Built Binaries (Recommended)

No compilation needed! Each binary includes ALL GPU backends for your platform:

Platform Download GPU Support Auto-Detects
Windows x64 shimmy-windows-x86_64.exe CUDA + Vulkan + OpenCL โœ…
Linux x86_64 shimmy-linux-x86_64 CUDA + Vulkan + OpenCL โœ…
macOS ARM64 shimmy-macos-arm64 MLX (Apple Silicon) โœ…
macOS Intel shimmy-macos-intel CPU only N/A
Linux ARM64 shimmy-linux-aarch64 CPU only N/A

How it works: Download one file, run it. Shimmy automatically detects and uses your GPU!

# Windows example
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
./shimmy.exe serve --gpu-backend auto  # Auto-detects CUDA/Vulkan/OpenCL

# Linux example  
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy
chmod +x shimmy
./shimmy serve --gpu-backend auto  # Auto-detects CUDA/Vulkan/OpenCL

# macOS ARM64 example
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy
chmod +x shimmy  
./shimmy serve  # Auto-detects MLX on Apple Silicon

๐ŸŽฏ GPU Auto-Detection

Shimmy uses intelligent GPU detection with this priority order:

  1. CUDA (NVIDIA GPUs via nvidia-smi)
  2. Vulkan (Cross-platform GPUs via vulkaninfo)
  3. OpenCL (AMD/Intel GPUs via clinfo)
  4. MLX (Apple Silicon via system detection)
  5. CPU (Fallback if no GPU detected)

No manual configuration needed! Just run with --gpu-backend auto (default).

๐Ÿ”ง Manual Backend Override

Want to force a specific backend? Use the --gpu-backend flag:

# Auto-detect (default - recommended)
shimmy serve --gpu-backend auto

# Force CPU (for testing or compatibility)
shimmy serve --gpu-backend cpu

# Force CUDA (NVIDIA GPUs only)
shimmy serve --gpu-backend cuda

# Force Vulkan (AMD/Intel/Cross-platform)
shimmy serve --gpu-backend vulkan

# Force OpenCL (AMD/Intel alternative)
shimmy serve --gpu-backend opencl

๐Ÿ›ก๏ธ Error Handling & Robustness: If you force an unavailable backend (e.g., --gpu-backend cuda on AMD GPU), Shimmy will:

  1. โœ… Display clear error message explaining the issue
  2. โœ… Automatically fallback to next available backend in priority order
  3. โœ… Log which backend was actually used (check with --verbose)
  4. โœ… Continue serving requests (graceful degradation, no crashes)
  5. โœ… Support environment variable override: SHIMMY_GPU_BACKEND=cuda

Common scenarios:

  • --gpu-backend cuda on non-NVIDIA โ†’ Falls back to Vulkan or OpenCL
  • --gpu-backend vulkan without drivers โ†’ Falls back to OpenCL or CPU
  • --gpu-backend invalid โ†’ Clear error + fallback to auto-detection
  • No GPU detected โ†’ Runs on CPU with performance warning

Environment Variable: Set SHIMMY_GPU_BACKEND=cuda to override default without CLI flags.

๐Ÿ” Check GPU Support

# Show detected GPU backends
shimmy gpu-info

# Check which backend is being used
shimmy serve --gpu-backend auto --verbose

โšก Binary Sizes

  • GPU-enabled binaries (Windows/Linux x64, macOS ARM64): ~40-50MB
  • CPU-only binaries (macOS Intel, Linux ARM64): ~20-30MB

Trade-off: Slightly larger binaries for zero compilation and automatic GPU detection.

๐Ÿ› ๏ธ Build from Source (Advanced)

Want to customize or contribute? Build from source:

  • Multiple backends can be compiled in, best one selected automatically
  • Use --gpu-backend <backend> to force specific backend

Get Models

Shimmy auto-discovers models from:

  • Hugging Face cache: ~/.cache/huggingface/hub/
  • Ollama models: ~/.ollama/models/
  • Local directory: ./models/
  • Environment: SHIMMY_BASE_GGUF=path/to/model.gguf
# Download models that work out of the box
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir ./models/
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --local-dir ./models/

Start Server

# Auto-allocates port to avoid conflicts
shimmy serve

# Or use manual port
shimmy serve --bind 127.0.0.1:11435

Point your development tools to the displayed port โ€” VSCode Copilot, Cursor, Continue.dev all work instantly.

๐Ÿ“ฆ Download & Install

Package Managers

Direct Downloads

  • GitHub Releases: Latest binaries
  • Docker: docker pull shimmy/shimmy:latest (coming soon)

๐ŸŽ macOS Support

Full compatibility confirmed! Shimmy works flawlessly on macOS with Metal GPU acceleration.

# Install dependencies
brew install cmake rust

# Install shimmy
cargo install shimmy

โœ… Verified working:

  • Intel and Apple Silicon Macs
  • Metal GPU acceleration (automatic)
  • MLX native acceleration for Apple Silicon
  • Xcode 17+ compatibility
  • All LoRA adapter features

Integration Examples

VSCode Copilot

{
  "github.copilot.advanced": {
    "serverUrl": "http://localhost:11435"
  }
}

Continue.dev

{
  "models": [{
    "title": "Local Shimmy",
    "provider": "openai",
    "model": "your-model-name",
    "apiBase": "http://localhost:11435/v1"
  }]
}

Cursor IDE

Works out of the box - just point to http://localhost:11435/v1

Why Shimmy Will Always Be Free

I built Shimmy to retain privacy-first control on my AI development and keep things local and lean.

This is my commitment: Shimmy stays MIT licensed, forever. If you want to support development, sponsor it. If you don't, just build something cool with it.

๐Ÿ’ก Shimmy saves you time and money. If it's useful, consider sponsoring for $5/month โ€” less than your Netflix subscription, infinitely more useful for developers.

API Reference

Endpoints

  • GET /health - Health check
  • POST /v1/chat/completions - OpenAI-compatible chat
  • GET /v1/models - List available models
  • POST /api/generate - Shimmy native API
  • GET /ws/generate - WebSocket streaming

CLI Commands

shimmy serve                    # Start server (auto port allocation)
shimmy serve --bind 127.0.0.1:8080  # Manual port binding
shimmy serve --cpu-moe --n-cpu-moe 8  # Enable MOE CPU offloading
shimmy list                     # Show available models (LLM-filtered)
shimmy discover                 # Refresh model discovery
shimmy generate --name X --prompt "Hi"  # Test generation
shimmy probe model-name         # Verify model loads
shimmy gpu-info                 # Show GPU backend status

Technical Architecture

  • Rust + Tokio: Memory-safe, async performance
  • llama.cpp backend: Industry-standard GGUF inference
  • OpenAI API compatibility: Drop-in replacement
  • Dynamic port management: Zero conflicts, auto-allocation
  • Zero-config auto-discovery: Just worksโ„ข

๐Ÿš€ Advanced Features

  • ๐Ÿง  MOE CPU Offloading: Hybrid GPU/CPU processing for large models (70B+)
  • ๐ŸŽฏ Smart Model Filtering: Automatically excludes non-language models (Stable Diffusion, Whisper, CLIP)
  • ๐Ÿ›ก๏ธ 6-Gate Release Validation: Constitutional quality limits ensure reliability
  • โšก Smart Model Preloading: Background loading with usage tracking for instant model switching
  • ๐Ÿ’พ Response Caching: LRU + TTL cache delivering 20-40% performance gains on repeat queries
  • ๐Ÿš€ Integration Templates: One-command deployment for Docker, Kubernetes, Railway, Fly.io, FastAPI, Express
  • ๐Ÿ”„ Request Routing: Multi-instance support with health checking and load balancing
  • ๐Ÿ“Š Advanced Observability: Real-time metrics with self-optimization and Prometheus integration
  • ๐Ÿ”— RustChain Integration: Universal workflow transpilation with workflow orchestration

Community & Support

Star History

Star History Chart

๐Ÿš€ Momentum Snapshot

๐Ÿ“ฆ Sub-5MB single binary (142x smaller than Ollama) ๐ŸŒŸ GitHub stars stars and climbing fast โฑ <1s startup ๐Ÿฆ€ 100% Rust, no Python

๐Ÿ“ฐ As Featured On

๐Ÿ”ฅ Hacker News โ€ข Front Page Again โ€ข IPE Newsletter

Companies: Need invoicing? Email michaelallenkuykendall@gmail.com

โšก Performance Comparison

Tool Binary Size Startup Time Memory Usage OpenAI API
Shimmy 4.8MB <100ms 50MB 100%
Ollama 680MB 5-10s 200MB+ Partial
llama.cpp 89MB 1-2s 100MB Via llama-server

Quality & Reliability

Shimmy maintains high code quality through comprehensive testing:

  • Comprehensive test suite with property-based testing
  • Automated CI/CD pipeline with quality gates
  • Runtime invariant checking for critical operations
  • Cross-platform compatibility testing

Development Testing

Run the complete test suite:

# Using cargo aliases
cargo test-quick           # Quick development tests

# Using Makefile  
make test                  # Full test suite
make test-quick            # Quick development tests

See our testing approach for technical details.


License & Philosophy

MIT License - forever and always.

Philosophy: Infrastructure should be invisible. Shimmy is infrastructure.

Testing Philosophy: Reliability through comprehensive validation and property-based testing.


Forever maintainer: Michael A. Kuykendall Promise: This will never become a paid product Mission: Making local model inference simple and reliable

Commit count: 687

cargo fmt