lmonade-server

Crates.io	lmonade-server
lib.rs	lmonade-server
version	0.1.0-alpha.2
created_at	2025-08-20 14:12:14.661644+00
updated_at	2025-08-21 22:07:50.499023+00
description	HTTP API server with OpenAI-compatible endpoints for the Lmonade LLM inference engine
homepage
repository	https://jgok76.gitea.cloud/femtomc/lmonade
max_upload_size
id	1803471
size	176,557

McCoy R. Becker (femtomc)

documentation

README

lmonade-server

OpenAI-compatible HTTP API server for the Lmonade LLM inference engine.

Overview

lmonade-server provides a production-ready HTTP server with endpoints compatible with OpenAI's API, enabling drop-in replacement for existing OpenAI integrations.

Features

OpenAI-compatible REST API
Real-time token streaming via Server-Sent Events (SSE)
Concurrent request handling
Health monitoring endpoints
CORS support for web applications
Comprehensive error handling

API Endpoints

Core Endpoints

Endpoint	Method	Description
`/health`	GET	Health check endpoint
`/v1/models`	GET	List available models
`/v1/chat/completions`	POST	Chat completion (OpenAI-compatible)
`/v1/completions`	POST	Text completion (OpenAI-compatible)
`/v1/embeddings`	POST	Generate embeddings (placeholder)

Request Examples

Chat Completion

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama-1.1B-Chat-v1.0",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "stream": false
  }'

Streaming Chat Completion

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama-1.1B-Chat-v1.0",
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ],
    "stream": true
  }'

Configuration

Environment Variables

Variable	Description	Default
`HOST`	Server bind address	`127.0.0.1`
`PORT`	Server port	`8080`
`MODEL_NAME`	Default model to load	`TinyLlama-1.1B-Chat-v1.0`
`MAX_CONCURRENT_REQUESTS`	Maximum concurrent requests	`100`
`REQUEST_TIMEOUT_SECS`	Request timeout in seconds	`300`
`RUST_LOG`	Logging level	`info`

Configuration File

Create a config.toml file:

[server]
host = "0.0.0.0"
port = 8080

[model]
name = "TinyLlama-1.1B-Chat-v1.0"
max_concurrent_requests = 100

[logging]
level = "info"

Running the Server

Using Cargo

# Development
cargo run -- stand

# Release (optimized)
cargo run --release -- serve

Using Pre-built Binary

./lmonade stand

Docker

docker run -p 8080:8080 lmonade/server:latest

Architecture

The server is built on top of:

Axum: High-performance async web framework
Tokio: Async runtime
Tower: Middleware and service composition

Request Flow

HTTP request received by Axum router
Request validated and parsed
Forwarded to LLMService
LLMService communicates with ModelHub (actor system)
Response streamed back to client

Streaming Implementation

The server implements true token-by-token streaming:

// SSE streaming for real-time generation
let stream = hub.generate_stream(model, prompt, config).await?;
let sse_stream = stream.map(|token| {
    Event::default().data(serde_json::to_string(&token)?)
});

Error Handling

The server provides detailed error responses:

{
  "error": {
    "message": "Model not found: gpt-4",
    "type": "invalid_request_error",
    "code": "model_not_found"
  }
}

Monitoring

Health Check

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "model": "TinyLlama-1.1B-Chat-v1.0",
  "uptime_seconds": 3600,
  "requests_processed": 1234
}

Metrics

The server exposes Prometheus-compatible metrics at /metrics (when enabled).

Development

Project Structure

lmonade-server/
├── src/
│   ├── api_handlers.rs    # HTTP request handlers
│   ├── llm_service.rs     # Core service logic
│   ├── routes.rs          # API route definitions
│   ├── config.rs          # Configuration management
│   ├── error.rs           # Error types
│   └── lib.rs             # Library exports
├── bin/
│   └── serve.rs           # Server binary
└── tests/
    └── integration.rs     # Integration tests

Adding New Endpoints

Define handler in api_handlers.rs
Add route in routes.rs
Update OpenAPI spec if applicable

Testing

# Run tests
cargo test

# Integration tests
cargo test --test integration

# With logging
RUST_LOG=debug cargo test

Performance

Benchmarks

On a typical setup:

Throughput: ~1000 tokens/second
Latency: <50ms first token
Concurrent requests: 100+

Optimization Tips

Use release builds for production
Enable GPU acceleration if available
Adjust batch sizes based on hardware
Use connection pooling for clients

Troubleshooting

Common Issues

Port already in use:

# Change port
PORT=8081 ./lmonade stand

Model not loading:

# Check model path
ls ~/.lmonade/models/

Out of memory:

# Reduce batch size
MAX_BATCH_SIZE=8 ./lmonade stand

License

See LICENSE in the root directory.

Contributing

See CONTRIBUTING.md for guidelines.

Commit count: 0

lmonade-server

documentation

README

lmonade-server

Overview

Features

API Endpoints

Core Endpoints

Request Examples

Chat Completion

Streaming Chat Completion

Configuration

Environment Variables

Configuration File

Running the Server

Using Cargo

Using Pre-built Binary

Docker

Architecture

Request Flow

Streaming Implementation

Error Handling

Monitoring

Health Check

Metrics

Development

Project Structure

Adding New Endpoints

Testing

Performance

Benchmarks

Optimization Tips

Troubleshooting

Common Issues

License

Contributing

cargo fmt