lmonade-server

Crates.iolmonade-server
lib.rslmonade-server
version0.1.0-alpha.2
created_at2025-08-20 14:12:14.661644+00
updated_at2025-08-21 22:07:50.499023+00
descriptionHTTP API server with OpenAI-compatible endpoints for the Lmonade LLM inference engine
homepage
repositoryhttps://jgok76.gitea.cloud/femtomc/lmonade
max_upload_size
id1803471
size176,557
McCoy R. Becker (femtomc)

documentation

README

lmonade-server

OpenAI-compatible HTTP API server for the Lmonade LLM inference engine.

Overview

lmonade-server provides a production-ready HTTP server with endpoints compatible with OpenAI's API, enabling drop-in replacement for existing OpenAI integrations.

Features

  • OpenAI-compatible REST API
  • Real-time token streaming via Server-Sent Events (SSE)
  • Concurrent request handling
  • Health monitoring endpoints
  • CORS support for web applications
  • Comprehensive error handling

API Endpoints

Core Endpoints

Endpoint Method Description
/health GET Health check endpoint
/v1/models GET List available models
/v1/chat/completions POST Chat completion (OpenAI-compatible)
/v1/completions POST Text completion (OpenAI-compatible)
/v1/embeddings POST Generate embeddings (placeholder)

Request Examples

Chat Completion

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama-1.1B-Chat-v1.0",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "stream": false
  }'

Streaming Chat Completion

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama-1.1B-Chat-v1.0",
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ],
    "stream": true
  }'

Configuration

Environment Variables

Variable Description Default
HOST Server bind address 127.0.0.1
PORT Server port 8080
MODEL_NAME Default model to load TinyLlama-1.1B-Chat-v1.0
MAX_CONCURRENT_REQUESTS Maximum concurrent requests 100
REQUEST_TIMEOUT_SECS Request timeout in seconds 300
RUST_LOG Logging level info

Configuration File

Create a config.toml file:

[server]
host = "0.0.0.0"
port = 8080

[model]
name = "TinyLlama-1.1B-Chat-v1.0"
max_concurrent_requests = 100

[logging]
level = "info"

Running the Server

Using Cargo

# Development
cargo run -- stand

# Release (optimized)
cargo run --release -- serve

Using Pre-built Binary

./lmonade stand

Docker

docker run -p 8080:8080 lmonade/server:latest

Architecture

The server is built on top of:

  • Axum: High-performance async web framework
  • Tokio: Async runtime
  • Tower: Middleware and service composition

Request Flow

  1. HTTP request received by Axum router
  2. Request validated and parsed
  3. Forwarded to LLMService
  4. LLMService communicates with ModelHub (actor system)
  5. Response streamed back to client

Streaming Implementation

The server implements true token-by-token streaming:

// SSE streaming for real-time generation
let stream = hub.generate_stream(model, prompt, config).await?;
let sse_stream = stream.map(|token| {
    Event::default().data(serde_json::to_string(&token)?)
});

Error Handling

The server provides detailed error responses:

{
  "error": {
    "message": "Model not found: gpt-4",
    "type": "invalid_request_error",
    "code": "model_not_found"
  }
}

Monitoring

Health Check

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "model": "TinyLlama-1.1B-Chat-v1.0",
  "uptime_seconds": 3600,
  "requests_processed": 1234
}

Metrics

The server exposes Prometheus-compatible metrics at /metrics (when enabled).

Development

Project Structure

lmonade-server/
├── src/
│   ├── api_handlers.rs    # HTTP request handlers
│   ├── llm_service.rs     # Core service logic
│   ├── routes.rs          # API route definitions
│   ├── config.rs          # Configuration management
│   ├── error.rs           # Error types
│   └── lib.rs             # Library exports
├── bin/
│   └── serve.rs           # Server binary
└── tests/
    └── integration.rs     # Integration tests

Adding New Endpoints

  1. Define handler in api_handlers.rs
  2. Add route in routes.rs
  3. Update OpenAPI spec if applicable

Testing

# Run tests
cargo test

# Integration tests
cargo test --test integration

# With logging
RUST_LOG=debug cargo test

Performance

Benchmarks

On a typical setup:

  • Throughput: ~1000 tokens/second
  • Latency: <50ms first token
  • Concurrent requests: 100+

Optimization Tips

  1. Use release builds for production
  2. Enable GPU acceleration if available
  3. Adjust batch sizes based on hardware
  4. Use connection pooling for clients

Troubleshooting

Common Issues

Port already in use:

# Change port
PORT=8081 ./lmonade stand

Model not loading:

# Check model path
ls ~/.lmonade/models/

Out of memory:

# Reduce batch size
MAX_BATCH_SIZE=8 ./lmonade stand

License

See LICENSE in the root directory.

Contributing

See CONTRIBUTING.md for guidelines.

Commit count: 0

cargo fmt