ushi

Crates.ioushi
lib.rsushi
version0.1.1
created_at2025-08-10 12:20:14.707496+00
updated_at2025-08-10 12:23:04.918261+00
descriptionHigh-performance LLM inference server with llama.cpp FFI bindings
homepagehttps://github.com/evil-mind-evil-sword/ushi
repositoryhttps://github.com/evil-mind-evil-sword/ushi
max_upload_size
id1788861
size790,345
McCoy R. Becker (femtomc)

documentation

https://github.com/evil-mind-evil-sword/ushi

README

Ushi

Test Suite Code Quality Benchmarks

Production-grade LLM inference server built in Rust with llama.cpp FFI bindings. Optimized for high throughput and low latency.

Features

  • High Performance - Continuous batching, GPU acceleration, and optimized memory management
  • OpenAI Compatible - Drop-in replacement for OpenAI API endpoints
  • Model Management - Download from HuggingFace, hot-swap models, multi-model support
  • Hardware Acceleration - Metal (macOS) and CUDA (NVIDIA) support
  • Smart Caching - KV cache, paged attention, and semantic prompt caching
  • Real-time Streaming - Server-sent events for token-by-token generation

Quick Start

Prerequisites

Ushi requires llama.cpp to be installed on your system. The build process will automatically detect common installation paths.

Option 1: Install from Homebrew (macOS)

brew install llama.cpp

Option 2: Build from source

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DBUILD_SHARED_LIBS=ON
cmake --build build --config Release
sudo cmake --install build

Option 3: Custom installation

# Install to a custom location
cmake -B build -DCMAKE_INSTALL_PREFIX=$HOME/.local -DBUILD_SHARED_LIBS=ON
cmake --build build --config Release
cmake --install build

# Then set LLAMA_CPP_PATH when building Ushi
export LLAMA_CPP_PATH=$HOME/.local

Installation

From crates.io (coming soon)

cargo install ushi

From source

git clone https://github.com/evil-mind-evil-sword/ushi.git
cd ushi
cargo build --release

The build process automatically searches for llama.cpp in:

  • /usr/local (default for source builds)
  • /usr (system packages)
  • /opt/homebrew (Homebrew on Apple Silicon)
  • /opt/local (MacPorts)
  • $HOME/.local (user installations)
  • Via pkg-config if available
  • $LLAMA_CPP_PATH environment variable (if set)

Requirements:

  • Rust 1.75+
  • llama.cpp library (see prerequisites above)
  • C++ compiler (for bindgen)

Start the Server

cargo run --release
# Server starts on http://localhost:8080

Download and Use a Model

# 1. Browse available models from HuggingFace
curl "http://localhost:8080/v1/models/huggingface?search=llama&limit=5"

# 2. Download a model from HuggingFace (using repo_id and filename)
curl -X POST http://localhost:8080/v1/models/download \
  -H "Content-Type: application/json" \
  -d '{
    "repo_id": "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF",
    "filename": "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
  }'

# 3. Load the model (use the filename without .gguf extension as model_id)
curl -X POST http://localhost:8080/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
    "model_id": "tinyllama"
  }'

# 4. Chat with the model
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tinyllama",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Write a haiku about Rust programming."}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Core API Endpoints

Chat Completions (OpenAI Compatible)

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tinyllama",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100,
    "temperature": 0.7,
    "stream": false
  }'

Text Completions

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tinyllama",
    "prompt": "The meaning of life is",
    "max_tokens": 50
  }'

Model Management

# List loaded models
curl http://localhost:8080/v1/models

# List local model files
curl http://localhost:8080/v1/models/catalog

# Browse HuggingFace models
curl "http://localhost:8080/v1/models/huggingface?author=TheBloke&limit=10"

# Load a model with GPU acceleration
curl -X POST http://localhost:8080/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "models/llama-2-7b.Q4_K_M.gguf",
    "model_id": "llama2",
    "n_gpu_layers": 35
  }'

# Unload a model
curl -X POST http://localhost:8080/v1/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model_id": "llama2"}'

Advanced Usage

Running Multiple Models

# Load multiple models simultaneously
curl -X POST http://localhost:8080/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{"model_path": "models/llama.gguf", "model_id": "llama"}'

curl -X POST http://localhost:8080/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{"model_path": "models/mistral.gguf", "model_id": "mistral"}'

# Use specific model in requests
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral", "messages": [{"role": "user", "content": "Hello"}]}'

Streaming Responses

import requests
import json

# Enable streaming with stream=true or use /stream endpoint
response = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={
        "model": "tinyllama",
        "messages": [{"role": "user", "content": "Tell me a story"}],
        "stream": True,
        "max_tokens": 200
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        # Each line is a JSON object prefixed with "data: "
        if line.startswith(b"data: "):
            data = json.loads(line[6:])
            if data.get("choices"):
                print(data["choices"][0]["delta"].get("content", ""), end="")

Performance Tuning

# Optimize for throughput (more GPU layers)
curl -X POST http://localhost:8080/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "models/model.gguf",
    "model_id": "fast-model",
    "n_gpu_layers": 35,
    "context_size": 2048
  }'

# Check server health and metrics
curl http://localhost:8080/admin/health
curl http://localhost:8080/admin/metrics

Python Client Example

This example shows how to interact with a running Ushi server from Python using HTTP requests:

# First, ensure the Ushi server is running:
# cargo run --release

import requests
import json

class UshiClient:
    def __init__(self, base_url="http://localhost:8080"):
        self.base_url = base_url
        self.session = requests.Session()
    
    def load_model(self, model_path, model_id, **kwargs):
        """Load a model from disk"""
        return self.session.post(
            f"{self.base_url}/v1/models/load",
            json={"model_path": model_path, "model_id": model_id, **kwargs}
        ).json()
    
    def chat(self, messages, model="tinyllama", **kwargs):
        """Send a chat completion request"""
        return self.session.post(
            f"{self.base_url}/v1/chat/completions",
            json={"model": model, "messages": messages, **kwargs}
        ).json()
    
    def list_models(self):
        """List all loaded models"""
        return self.session.get(f"{self.base_url}/v1/models").json()

# Usage
client = UshiClient()

# Load a model
client.load_model("models/tinyllama.gguf", "tinyllama", n_gpu_layers=20)

# Chat
response = client.chat([
    {"role": "user", "content": "What is the capital of France?"}
], max_tokens=50)

print(response["choices"][0]["message"]["content"])

Configuration

Environment Variables

# Logging
RUST_LOG=info                    # Set log level (debug, info, warn, error)

# Performance
OMP_NUM_THREADS=8               # CPU threads for inference
USHI__BATCH__MAX_SIZE=32        # Maximum batch size

# Tracing (optional)
OTEL_ENABLED=true               # Enable OpenTelemetry
OTEL_ENDPOINT=http://localhost:4317

Configuration File

Create config.toml:

[server]
host = "0.0.0.0"
port = 8080

[models]
model_dir = "./models"
default_context_size = 2048

[batch]
max_batch_size = 32
timeout_ms = 100

[cache]
max_entries = 1000
ttl_seconds = 3600

Development

# Run tests
cargo test --lib                # Unit tests
cargo test --test integration   # Integration tests

# Code quality
cargo xtask quality            # Full quality report
cargo xtask fmt                # Format code
cargo clippy                   # Linting

# Benchmarks
cargo bench                    # Run performance benchmarks

# Documentation
cargo doc --open              # Generate and open API docs

Project Structure

ushi/
├── src/
│   ├── api/          # REST API handlers and routing
│   ├── batch/        # Request batching and queueing
│   ├── cache/        # KV cache and prompt caching
│   ├── ffi/          # llama.cpp FFI bindings
│   ├── generation/   # Token generation pipeline
│   ├── models/       # Model management and registry
│   └── server/       # Server initialization
├── tests/            # Integration tests
├── benches/          # Performance benchmarks
└── docs/             # Additional documentation

Troubleshooting

llama.cpp not found during build

If the build fails with "llama.cpp not found", ensure:

  1. llama.cpp is installed with shared libraries (-DBUILD_SHARED_LIBS=ON)
  2. The installation path contains both lib/libllama.{dylib,so} and include/llama.h
  3. Or set LLAMA_CPP_PATH explicitly: export LLAMA_CPP_PATH=/path/to/llama.cpp

Model won't load

  • Ensure the model file is in GGUF format
  • Check available memory (use smaller quantization if needed)
  • Verify the model path is correct

Low performance

  • Enable GPU layers: n_gpu_layers: 35 (adjust based on GPU)
  • Use quantized models (Q4_K_M recommended)
  • Reduce context size if not needed

Out of memory

  • Use smaller quantization (Q4_K_M instead of Q8_0)
  • Reduce batch size in configuration
  • Offload fewer layers to GPU

License

GPL-3.0-or-later. See LICENSE for details.

Acknowledgments

Built on:

  • llama.cpp - High-performance C++ inference
  • Axum - Ergonomic web framework
  • Tokio - Asynchronous runtime

Links

Commit count: 0

cargo fmt