Crates.io | ushi |
lib.rs | ushi |
version | 0.1.1 |
created_at | 2025-08-10 12:20:14.707496+00 |
updated_at | 2025-08-10 12:23:04.918261+00 |
description | High-performance LLM inference server with llama.cpp FFI bindings |
homepage | https://github.com/evil-mind-evil-sword/ushi |
repository | https://github.com/evil-mind-evil-sword/ushi |
max_upload_size | |
id | 1788861 |
size | 790,345 |
Production-grade LLM inference server built in Rust with llama.cpp FFI bindings. Optimized for high throughput and low latency.
Ushi requires llama.cpp to be installed on your system. The build process will automatically detect common installation paths.
brew install llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DBUILD_SHARED_LIBS=ON
cmake --build build --config Release
sudo cmake --install build
# Install to a custom location
cmake -B build -DCMAKE_INSTALL_PREFIX=$HOME/.local -DBUILD_SHARED_LIBS=ON
cmake --build build --config Release
cmake --install build
# Then set LLAMA_CPP_PATH when building Ushi
export LLAMA_CPP_PATH=$HOME/.local
cargo install ushi
git clone https://github.com/evil-mind-evil-sword/ushi.git
cd ushi
cargo build --release
The build process automatically searches for llama.cpp in:
/usr/local
(default for source builds)/usr
(system packages)/opt/homebrew
(Homebrew on Apple Silicon)/opt/local
(MacPorts)$HOME/.local
(user installations)pkg-config
if available$LLAMA_CPP_PATH
environment variable (if set)Requirements:
cargo run --release
# Server starts on http://localhost:8080
# 1. Browse available models from HuggingFace
curl "http://localhost:8080/v1/models/huggingface?search=llama&limit=5"
# 2. Download a model from HuggingFace (using repo_id and filename)
curl -X POST http://localhost:8080/v1/models/download \
-H "Content-Type: application/json" \
-d '{
"repo_id": "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF",
"filename": "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
}'
# 3. Load the model (use the filename without .gguf extension as model_id)
curl -X POST http://localhost:8080/v1/models/load \
-H "Content-Type: application/json" \
-d '{
"model_path": "models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
"model_id": "tinyllama"
}'
# 4. Chat with the model
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tinyllama",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about Rust programming."}
],
"max_tokens": 100,
"temperature": 0.7
}'
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tinyllama",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100,
"temperature": 0.7,
"stream": false
}'
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tinyllama",
"prompt": "The meaning of life is",
"max_tokens": 50
}'
# List loaded models
curl http://localhost:8080/v1/models
# List local model files
curl http://localhost:8080/v1/models/catalog
# Browse HuggingFace models
curl "http://localhost:8080/v1/models/huggingface?author=TheBloke&limit=10"
# Load a model with GPU acceleration
curl -X POST http://localhost:8080/v1/models/load \
-H "Content-Type: application/json" \
-d '{
"model_path": "models/llama-2-7b.Q4_K_M.gguf",
"model_id": "llama2",
"n_gpu_layers": 35
}'
# Unload a model
curl -X POST http://localhost:8080/v1/models/unload \
-H "Content-Type: application/json" \
-d '{"model_id": "llama2"}'
# Load multiple models simultaneously
curl -X POST http://localhost:8080/v1/models/load \
-H "Content-Type: application/json" \
-d '{"model_path": "models/llama.gguf", "model_id": "llama"}'
curl -X POST http://localhost:8080/v1/models/load \
-H "Content-Type: application/json" \
-d '{"model_path": "models/mistral.gguf", "model_id": "mistral"}'
# Use specific model in requests
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "mistral", "messages": [{"role": "user", "content": "Hello"}]}'
import requests
import json
# Enable streaming with stream=true or use /stream endpoint
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"model": "tinyllama",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": True,
"max_tokens": 200
},
stream=True
)
for line in response.iter_lines():
if line:
# Each line is a JSON object prefixed with "data: "
if line.startswith(b"data: "):
data = json.loads(line[6:])
if data.get("choices"):
print(data["choices"][0]["delta"].get("content", ""), end="")
# Optimize for throughput (more GPU layers)
curl -X POST http://localhost:8080/v1/models/load \
-H "Content-Type: application/json" \
-d '{
"model_path": "models/model.gguf",
"model_id": "fast-model",
"n_gpu_layers": 35,
"context_size": 2048
}'
# Check server health and metrics
curl http://localhost:8080/admin/health
curl http://localhost:8080/admin/metrics
This example shows how to interact with a running Ushi server from Python using HTTP requests:
# First, ensure the Ushi server is running:
# cargo run --release
import requests
import json
class UshiClient:
def __init__(self, base_url="http://localhost:8080"):
self.base_url = base_url
self.session = requests.Session()
def load_model(self, model_path, model_id, **kwargs):
"""Load a model from disk"""
return self.session.post(
f"{self.base_url}/v1/models/load",
json={"model_path": model_path, "model_id": model_id, **kwargs}
).json()
def chat(self, messages, model="tinyllama", **kwargs):
"""Send a chat completion request"""
return self.session.post(
f"{self.base_url}/v1/chat/completions",
json={"model": model, "messages": messages, **kwargs}
).json()
def list_models(self):
"""List all loaded models"""
return self.session.get(f"{self.base_url}/v1/models").json()
# Usage
client = UshiClient()
# Load a model
client.load_model("models/tinyllama.gguf", "tinyllama", n_gpu_layers=20)
# Chat
response = client.chat([
{"role": "user", "content": "What is the capital of France?"}
], max_tokens=50)
print(response["choices"][0]["message"]["content"])
# Logging
RUST_LOG=info # Set log level (debug, info, warn, error)
# Performance
OMP_NUM_THREADS=8 # CPU threads for inference
USHI__BATCH__MAX_SIZE=32 # Maximum batch size
# Tracing (optional)
OTEL_ENABLED=true # Enable OpenTelemetry
OTEL_ENDPOINT=http://localhost:4317
Create config.toml
:
[server]
host = "0.0.0.0"
port = 8080
[models]
model_dir = "./models"
default_context_size = 2048
[batch]
max_batch_size = 32
timeout_ms = 100
[cache]
max_entries = 1000
ttl_seconds = 3600
# Run tests
cargo test --lib # Unit tests
cargo test --test integration # Integration tests
# Code quality
cargo xtask quality # Full quality report
cargo xtask fmt # Format code
cargo clippy # Linting
# Benchmarks
cargo bench # Run performance benchmarks
# Documentation
cargo doc --open # Generate and open API docs
ushi/
├── src/
│ ├── api/ # REST API handlers and routing
│ ├── batch/ # Request batching and queueing
│ ├── cache/ # KV cache and prompt caching
│ ├── ffi/ # llama.cpp FFI bindings
│ ├── generation/ # Token generation pipeline
│ ├── models/ # Model management and registry
│ └── server/ # Server initialization
├── tests/ # Integration tests
├── benches/ # Performance benchmarks
└── docs/ # Additional documentation
If the build fails with "llama.cpp not found", ensure:
-DBUILD_SHARED_LIBS=ON
)lib/libllama.{dylib,so}
and include/llama.h
LLAMA_CPP_PATH
explicitly: export LLAMA_CPP_PATH=/path/to/llama.cpp
n_gpu_layers: 35
(adjust based on GPU)GPL-3.0-or-later. See LICENSE for details.
Built on: