| Crates.io | shimmy |
| lib.rs | shimmy |
| version | 1.9.0 |
| created_at | 2025-09-04 19:16:09.102735+00 |
| updated_at | 2026-01-10 15:56:36.134969+00 |
| description | Lightweight sub-5MB Ollama alternative with native SafeTensors support. No Python dependencies, 2x faster loading. Now with GitHub Spec-Kit integration for systematic development. |
| homepage | https://github.com/Michael-A-Kuykendall/shimmy |
| repository | https://github.com/Michael-A-Kuykendall/shimmy |
| max_upload_size | |
| id | 1824656 |
| size | 867,110 |
Shimmy will be free forever. No asterisks. No "free for now." No pivot to paid.
๐ If Shimmy helps you, consider sponsoring โ 100% of support goes to keeping it free forever.
๐ฏ Become a Sponsor | See our amazing sponsors ๐
Shimmy is a single-binary that provides 100% OpenAI-compatible endpoints for GGUF models. Point your existing AI tools to Shimmy and they just work โ locally, privately, and free.
๐ NEW in v1.9.0: One download, all GPU backends included! No compilation, no backend confusion - just download and run.
Whether you're forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.
# 1) Download pre-built binary (includes all GPU backends)
# Windows:
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
./shimmy.exe serve &
# Linux:
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy
./shimmy serve &
# macOS (Apple Silicon):
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy
./shimmy serve &
# 2) See models and pick one
./shimmy list
# 3) Smoke test the OpenAI API
curl -s http://127.0.0.1:11435/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model":"REPLACE_WITH_MODEL_FROM_list",
"messages":[{"role":"user","content":"Say hi in 5 words."}],
"max_tokens":32
}' | jq -r '.choices[0].message.content'
No code changes needed - just change the API endpoint:
http://localhost:11435import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "http://127.0.0.1:11435/v1",
apiKey: "sk-local", // placeholder, Shimmy ignores it
});
const resp = await openai.chat.completions.create({
model: "REPLACE_WITH_MODEL",
messages: [{ role: "user", content: "Say hi in 5 words." }],
max_tokens: 32,
});
console.log(resp.choices[0].message?.content);
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local")
resp = client.chat.completions.create(
model="REPLACE_WITH_MODEL",
messages=[{"role": "user", "content": "Say hi in 5 words."}],
max_tokens=32,
)
print(resp.choices[0].message.content)
Run 70B+ models on consumer hardware with intelligent CPU/GPU hybrid processing:
--cpu-moe and --n-cpu-moe flags for fine control# Enable MOE CPU offloading during installation
cargo install shimmy --features moe
# Run with MOE hybrid processing
shimmy serve --cpu-moe --n-cpu-moe 8
# Automatically balances: GPU layers (fast) + CPU layers (memory-efficient)
Perfect for: Large models (70B+), limited VRAM systems, cost-effective inference
โจ v1.9.0 NEW: Download pre-built binaries with ALL GPU backends included!
Pick your platform and download - no compilation needed:
# Windows x64 (includes CUDA + Vulkan + OpenCL)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
# Linux x86_64 (includes CUDA + Vulkan + OpenCL)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy
# macOS ARM64 (includes MLX for Apple Silicon)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy
# macOS Intel (CPU-only)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-intel -o shimmy && chmod +x shimmy
# Linux ARM64 (CPU-only)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-aarch64 -o shimmy && chmod +x shimmy
That's it! Your GPU will be detected automatically at runtime.
Want to customize or contribute?
# Basic installation (CPU only)
cargo install shimmy --features huggingface
# Kitchen Sink builds (what pre-built binaries use):
# Windows/Linux x64:
cargo install shimmy --features huggingface,llama,llama-cuda,llama-vulkan,llama-opencl,vision
# macOS ARM64:
cargo install shimmy --features huggingface,llama,mlx,vision
# CPU-only (any platform):
cargo install shimmy --features huggingface,llama,vision
โ ๏ธ Build Notes:
- Windows: Install LLVM first for libclang.dll
- Recommended: Use pre-built binaries to avoid dependency issues
- Advanced users only: Building from source requires C++ compiler + CUDA/Vulkan SDKs
โจ NEW in v1.9.0: One binary per platform with automatic GPU detection!
โ ๏ธ IMPORTANT - Vision Feature Performance:
CPU-based vision inference (MiniCPM-V) is 5-10x slower than GPU acceleration.
CPU: 15-45 seconds per image | GPU (CUDA/Vulkan): 2-8 seconds per image
For production vision workloads, GPU acceleration is strongly recommended.
No compilation needed! Each binary includes ALL GPU backends for your platform:
| Platform | Download | GPU Support | Auto-Detects |
|---|---|---|---|
| Windows x64 | shimmy-windows-x86_64.exe | CUDA + Vulkan + OpenCL | โ |
| Linux x86_64 | shimmy-linux-x86_64 | CUDA + Vulkan + OpenCL | โ |
| macOS ARM64 | shimmy-macos-arm64 | MLX (Apple Silicon) | โ |
| macOS Intel | shimmy-macos-intel | CPU only | N/A |
| Linux ARM64 | shimmy-linux-aarch64 | CPU only | N/A |
How it works: Download one file, run it. Shimmy automatically detects and uses your GPU!
# Windows example
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
./shimmy.exe serve --gpu-backend auto # Auto-detects CUDA/Vulkan/OpenCL
# Linux example
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy
chmod +x shimmy
./shimmy serve --gpu-backend auto # Auto-detects CUDA/Vulkan/OpenCL
# macOS ARM64 example
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy
chmod +x shimmy
./shimmy serve # Auto-detects MLX on Apple Silicon
Shimmy uses intelligent GPU detection with this priority order:
No manual configuration needed! Just run with --gpu-backend auto (default).
Want to force a specific backend? Use the --gpu-backend flag:
# Auto-detect (default - recommended)
shimmy serve --gpu-backend auto
# Force CPU (for testing or compatibility)
shimmy serve --gpu-backend cpu
# Force CUDA (NVIDIA GPUs only)
shimmy serve --gpu-backend cuda
# Force Vulkan (AMD/Intel/Cross-platform)
shimmy serve --gpu-backend vulkan
# Force OpenCL (AMD/Intel alternative)
shimmy serve --gpu-backend opencl
๐ก๏ธ Error Handling & Robustness: If you force an unavailable backend (e.g., --gpu-backend cuda on AMD GPU), Shimmy will:
--verbose)SHIMMY_GPU_BACKEND=cudaCommon scenarios:
--gpu-backend cuda on non-NVIDIA โ Falls back to Vulkan or OpenCL--gpu-backend vulkan without drivers โ Falls back to OpenCL or CPU--gpu-backend invalid โ Clear error + fallback to auto-detectionEnvironment Variable: Set SHIMMY_GPU_BACKEND=cuda to override default without CLI flags.
# Show detected GPU backends
shimmy gpu-info
# Check which backend is being used
shimmy serve --gpu-backend auto --verbose
Trade-off: Slightly larger binaries for zero compilation and automatic GPU detection.
Want to customize or contribute? Build from source:
--gpu-backend <backend> to force specific backendShimmy auto-discovers models from:
~/.cache/huggingface/hub/~/.ollama/models/./models/SHIMMY_BASE_GGUF=path/to/model.gguf# Download models that work out of the box
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir ./models/
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --local-dir ./models/
# Auto-allocates port to avoid conflicts
shimmy serve
# Or use manual port
shimmy serve --bind 127.0.0.1:11435
Point your development tools to the displayed port โ VSCode Copilot, Cursor, Continue.dev all work instantly.
cargo install shimmy --features moe (recommended)cargo install shimmyshimmy-llama-cpp-2 packages for better compatibilitynpm install -g shimmy-js (planned)pip install shimmy (planned)docker pull shimmy/shimmy:latest (coming soon)Full compatibility confirmed! Shimmy works flawlessly on macOS with Metal GPU acceleration.
# Install dependencies
brew install cmake rust
# Install shimmy
cargo install shimmy
โ Verified working:
{
"github.copilot.advanced": {
"serverUrl": "http://localhost:11435"
}
}
{
"models": [{
"title": "Local Shimmy",
"provider": "openai",
"model": "your-model-name",
"apiBase": "http://localhost:11435/v1"
}]
}
Works out of the box - just point to http://localhost:11435/v1
I built Shimmy to retain privacy-first control on my AI development and keep things local and lean.
This is my commitment: Shimmy stays MIT licensed, forever. If you want to support development, sponsor it. If you don't, just build something cool with it.
๐ก Shimmy saves you time and money. If it's useful, consider sponsoring for $5/month โ less than your Netflix subscription, infinitely more useful for developers.
GET /health - Health checkPOST /v1/chat/completions - OpenAI-compatible chatGET /v1/models - List available modelsPOST /api/generate - Shimmy native APIGET /ws/generate - WebSocket streamingshimmy serve # Start server (auto port allocation)
shimmy serve --bind 127.0.0.1:8080 # Manual port binding
shimmy serve --cpu-moe --n-cpu-moe 8 # Enable MOE CPU offloading
shimmy list # Show available models (LLM-filtered)
shimmy discover # Refresh model discovery
shimmy generate --name X --prompt "Hi" # Test generation
shimmy probe model-name # Verify model loads
shimmy gpu-info # Show GPU backend status
๐ฆ Sub-5MB single binary (142x smaller than Ollama)
๐ stars and climbing fast
โฑ <1s startup
๐ฆ 100% Rust, no Python
๐ฅ Hacker News โข Front Page Again โข IPE Newsletter
Companies: Need invoicing? Email michaelallenkuykendall@gmail.com
| Tool | Binary Size | Startup Time | Memory Usage | OpenAI API |
|---|---|---|---|---|
| Shimmy | 4.8MB | <100ms | 50MB | 100% |
| Ollama | 680MB | 5-10s | 200MB+ | Partial |
| llama.cpp | 89MB | 1-2s | 100MB | Via llama-server |
Shimmy maintains high code quality through comprehensive testing:
Run the complete test suite:
# Using cargo aliases
cargo test-quick # Quick development tests
# Using Makefile
make test # Full test suite
make test-quick # Quick development tests
See our testing approach for technical details.
MIT License - forever and always.
Philosophy: Infrastructure should be invisible. Shimmy is infrastructure.
Testing Philosophy: Reliability through comprehensive validation and property-based testing.
Forever maintainer: Michael A. Kuykendall Promise: This will never become a paid product Mission: Making local model inference simple and reliable