ushi

Crates.io	ushi
lib.rs	ushi
version	0.1.1
created_at	2025-08-10 12:20:14.707496+00
updated_at	2025-08-10 12:23:04.918261+00
description	High-performance LLM inference server with llama.cpp FFI bindings
homepage	https://github.com/evil-mind-evil-sword/ushi
repository	https://github.com/evil-mind-evil-sword/ushi
max_upload_size
id	1788861
size	790,345

McCoy R. Becker (femtomc)

documentation

https://github.com/evil-mind-evil-sword/ushi

README

Ushi

Production-grade LLM inference server built in Rust with llama.cpp FFI bindings. Optimized for high throughput and low latency.

Features

High Performance - Continuous batching, GPU acceleration, and optimized memory management
OpenAI Compatible - Drop-in replacement for OpenAI API endpoints
Model Management - Download from HuggingFace, hot-swap models, multi-model support
Hardware Acceleration - Metal (macOS) and CUDA (NVIDIA) support
Smart Caching - KV cache, paged attention, and semantic prompt caching
Real-time Streaming - Server-sent events for token-by-token generation

Quick Start

Prerequisites

Ushi requires llama.cpp to be installed on your system. The build process will automatically detect common installation paths.

Option 1: Install from Homebrew (macOS)

brew install llama.cpp

Option 2: Build from source

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DBUILD_SHARED_LIBS=ON
cmake --build build --config Release
sudo cmake --install build

Option 3: Custom installation

# Install to a custom location
cmake -B build -DCMAKE_INSTALL_PREFIX=$HOME/.local -DBUILD_SHARED_LIBS=ON
cmake --build build --config Release
cmake --install build

# Then set LLAMA_CPP_PATH when building Ushi
export LLAMA_CPP_PATH=$HOME/.local

Installation

From crates.io (coming soon)

cargo install ushi

From source

git clone https://github.com/evil-mind-evil-sword/ushi.git
cd ushi
cargo build --release

The build process automatically searches for llama.cpp in:

/usr/local (default for source builds)
/usr (system packages)
/opt/homebrew (Homebrew on Apple Silicon)
/opt/local (MacPorts)
$HOME/.local (user installations)
Via pkg-config if available
$LLAMA_CPP_PATH environment variable (if set)

Requirements:

Rust 1.75+
llama.cpp library (see prerequisites above)
C++ compiler (for bindgen)

Start the Server

cargo run --release
# Server starts on http://localhost:8080

Download and Use a Model

# 1. Browse available models from HuggingFace
curl "http://localhost:8080/v1/models/huggingface?search=llama&limit=5"

# 2. Download a model from HuggingFace (using repo_id and filename)
curl -X POST http://localhost:8080/v1/models/download \
  -H "Content-Type: application/json" \
  -d '{
    "repo_id": "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF",
    "filename": "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
  }'

# 3. Load the model (use the filename without .gguf extension as model_id)
curl -X POST http://localhost:8080/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
    "model_id": "tinyllama"
  }'

# 4. Chat with the model
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tinyllama",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Write a haiku about Rust programming."}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Core API Endpoints

Chat Completions (OpenAI Compatible)

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tinyllama",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100,
    "temperature": 0.7,
    "stream": false
  }'

Text Completions

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tinyllama",
    "prompt": "The meaning of life is",
    "max_tokens": 50
  }'

Model Management

# List loaded models
curl http://localhost:8080/v1/models

# List local model files
curl http://localhost:8080/v1/models/catalog

# Browse HuggingFace models
curl "http://localhost:8080/v1/models/huggingface?author=TheBloke&limit=10"

# Load a model with GPU acceleration
curl -X POST http://localhost:8080/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "models/llama-2-7b.Q4_K_M.gguf",
    "model_id": "llama2",
    "n_gpu_layers": 35
  }'

# Unload a model
curl -X POST http://localhost:8080/v1/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model_id": "llama2"}'

Advanced Usage

Running Multiple Models

# Load multiple models simultaneously
curl -X POST http://localhost:8080/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{"model_path": "models/llama.gguf", "model_id": "llama"}'

curl -X POST http://localhost:8080/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{"model_path": "models/mistral.gguf", "model_id": "mistral"}'

# Use specific model in requests
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral", "messages": [{"role": "user", "content": "Hello"}]}'

Streaming Responses

import requests
import json

# Enable streaming with stream=true or use /stream endpoint
response = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={
        "model": "tinyllama",
        "messages": [{"role": "user", "content": "Tell me a story"}],
        "stream": True,
        "max_tokens": 200
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        # Each line is a JSON object prefixed with "data: "
        if line.startswith(b"data: "):
            data = json.loads(line[6:])
            if data.get("choices"):
                print(data["choices"][0]["delta"].get("content", ""), end="")

Performance Tuning

# Optimize for throughput (more GPU layers)
curl -X POST http://localhost:8080/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "models/model.gguf",
    "model_id": "fast-model",
    "n_gpu_layers": 35,
    "context_size": 2048
  }'

# Check server health and metrics
curl http://localhost:8080/admin/health
curl http://localhost:8080/admin/metrics

Python Client Example

This example shows how to interact with a running Ushi server from Python using HTTP requests:

# First, ensure the Ushi server is running:
# cargo run --release

import requests
import json

class UshiClient:
    def __init__(self, base_url="http://localhost:8080"):
        self.base_url = base_url
        self.session = requests.Session()
    
    def load_model(self, model_path, model_id, **kwargs):
        """Load a model from disk"""
        return self.session.post(
            f"{self.base_url}/v1/models/load",
            json={"model_path": model_path, "model_id": model_id, **kwargs}
        ).json()
    
    def chat(self, messages, model="tinyllama", **kwargs):
        """Send a chat completion request"""
        return self.session.post(
            f"{self.base_url}/v1/chat/completions",
            json={"model": model, "messages": messages, **kwargs}
        ).json()
    
    def list_models(self):
        """List all loaded models"""
        return self.session.get(f"{self.base_url}/v1/models").json()

# Usage
client = UshiClient()

# Load a model
client.load_model("models/tinyllama.gguf", "tinyllama", n_gpu_layers=20)

# Chat
response = client.chat([
    {"role": "user", "content": "What is the capital of France?"}
], max_tokens=50)

print(response["choices"][0]["message"]["content"])

Configuration

Environment Variables

# Logging
RUST_LOG=info                    # Set log level (debug, info, warn, error)

# Performance
OMP_NUM_THREADS=8               # CPU threads for inference
USHI__BATCH__MAX_SIZE=32        # Maximum batch size

# Tracing (optional)
OTEL_ENABLED=true               # Enable OpenTelemetry
OTEL_ENDPOINT=http://localhost:4317

Configuration File

Create config.toml:

[server]
host = "0.0.0.0"
port = 8080

[models]
model_dir = "./models"
default_context_size = 2048

[batch]
max_batch_size = 32
timeout_ms = 100

[cache]
max_entries = 1000
ttl_seconds = 3600

Development

# Run tests
cargo test --lib                # Unit tests
cargo test --test integration   # Integration tests

# Code quality
cargo xtask quality            # Full quality report
cargo xtask fmt                # Format code
cargo clippy                   # Linting

# Benchmarks
cargo bench                    # Run performance benchmarks

# Documentation
cargo doc --open              # Generate and open API docs

Project Structure

ushi/
├── src/
│   ├── api/          # REST API handlers and routing
│   ├── batch/        # Request batching and queueing
│   ├── cache/        # KV cache and prompt caching
│   ├── ffi/          # llama.cpp FFI bindings
│   ├── generation/   # Token generation pipeline
│   ├── models/       # Model management and registry
│   └── server/       # Server initialization
├── tests/            # Integration tests
├── benches/          # Performance benchmarks
└── docs/             # Additional documentation

Troubleshooting

llama.cpp not found during build

If the build fails with "llama.cpp not found", ensure:

llama.cpp is installed with shared libraries (-DBUILD_SHARED_LIBS=ON)
The installation path contains both lib/libllama.{dylib,so} and include/llama.h
Or set LLAMA_CPP_PATH explicitly: export LLAMA_CPP_PATH=/path/to/llama.cpp

Model won't load

Ensure the model file is in GGUF format
Check available memory (use smaller quantization if needed)
Verify the model path is correct

Low performance

Enable GPU layers: n_gpu_layers: 35 (adjust based on GPU)
Use quantized models (Q4_K_M recommended)
Reduce context size if not needed

Out of memory

Use smaller quantization (Q4_K_M instead of Q8_0)
Reduce batch size in configuration
Offload fewer layers to GPU

License

GPL-3.0-or-later. See LICENSE for details.

Acknowledgments

Built on:

llama.cpp - High-performance C++ inference
Axum - Ergonomic web framework
Tokio - Asynchronous runtime

Links

Commit count: 0