Crates.io | lmonade-server |
lib.rs | lmonade-server |
version | 0.1.0-alpha.2 |
created_at | 2025-08-20 14:12:14.661644+00 |
updated_at | 2025-08-21 22:07:50.499023+00 |
description | HTTP API server with OpenAI-compatible endpoints for the Lmonade LLM inference engine |
homepage | |
repository | https://jgok76.gitea.cloud/femtomc/lmonade |
max_upload_size | |
id | 1803471 |
size | 176,557 |
OpenAI-compatible HTTP API server for the Lmonade LLM inference engine.
lmonade-server
provides a production-ready HTTP server with endpoints compatible with OpenAI's API, enabling drop-in replacement for existing OpenAI integrations.
Endpoint | Method | Description |
---|---|---|
/health |
GET | Health check endpoint |
/v1/models |
GET | List available models |
/v1/chat/completions |
POST | Chat completion (OpenAI-compatible) |
/v1/completions |
POST | Text completion (OpenAI-compatible) |
/v1/embeddings |
POST | Generate embeddings (placeholder) |
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama-1.1B-Chat-v1.0",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"stream": false
}'
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama-1.1B-Chat-v1.0",
"messages": [
{"role": "user", "content": "Tell me a story"}
],
"stream": true
}'
Variable | Description | Default |
---|---|---|
HOST |
Server bind address | 127.0.0.1 |
PORT |
Server port | 8080 |
MODEL_NAME |
Default model to load | TinyLlama-1.1B-Chat-v1.0 |
MAX_CONCURRENT_REQUESTS |
Maximum concurrent requests | 100 |
REQUEST_TIMEOUT_SECS |
Request timeout in seconds | 300 |
RUST_LOG |
Logging level | info |
Create a config.toml
file:
[server]
host = "0.0.0.0"
port = 8080
[model]
name = "TinyLlama-1.1B-Chat-v1.0"
max_concurrent_requests = 100
[logging]
level = "info"
# Development
cargo run -- stand
# Release (optimized)
cargo run --release -- serve
./lmonade stand
docker run -p 8080:8080 lmonade/server:latest
The server is built on top of:
LLMService
LLMService
communicates with ModelHub
(actor system)The server implements true token-by-token streaming:
// SSE streaming for real-time generation
let stream = hub.generate_stream(model, prompt, config).await?;
let sse_stream = stream.map(|token| {
Event::default().data(serde_json::to_string(&token)?)
});
The server provides detailed error responses:
{
"error": {
"message": "Model not found: gpt-4",
"type": "invalid_request_error",
"code": "model_not_found"
}
}
curl http://localhost:8080/health
Response:
{
"status": "healthy",
"model": "TinyLlama-1.1B-Chat-v1.0",
"uptime_seconds": 3600,
"requests_processed": 1234
}
The server exposes Prometheus-compatible metrics at /metrics
(when enabled).
lmonade-server/
├── src/
│ ├── api_handlers.rs # HTTP request handlers
│ ├── llm_service.rs # Core service logic
│ ├── routes.rs # API route definitions
│ ├── config.rs # Configuration management
│ ├── error.rs # Error types
│ └── lib.rs # Library exports
├── bin/
│ └── serve.rs # Server binary
└── tests/
└── integration.rs # Integration tests
api_handlers.rs
routes.rs
# Run tests
cargo test
# Integration tests
cargo test --test integration
# With logging
RUST_LOG=debug cargo test
On a typical setup:
Port already in use:
# Change port
PORT=8081 ./lmonade stand
Model not loading:
# Check model path
ls ~/.lmonade/models/
Out of memory:
# Reduce batch size
MAX_BATCH_SIZE=8 ./lmonade stand
See LICENSE in the root directory.
See CONTRIBUTING.md for guidelines.