lmonade

Crates.io	lmonade
lib.rs	lmonade
version	0.1.0-alpha.2
created_at	2025-08-20 14:13:31.939927+00
updated_at	2025-08-21 22:08:24.779015+00
description	LLM inference engine - main crate with CLI and re-exports
homepage
repository	https://jgok76.gitea.cloud/femtomc/lmonade
max_upload_size
id	1803474
size	152,861

McCoy R. Becker (femtomc)

documentation

README

Lmonade

An LLM inference engine built in Rust with an actor-based architecture.

Quick Start

Installation

# Clone and build
git clone https://jgok76.gitea.cloud/femtomc/lmonade.git
cd lmonade
cargo build --release

# The CLI will be available at ./target/release/lmonade

CLI Usage

# Download a model
lmonade model download TinyLlama/TinyLlama-1.1B-Chat-v1.0

# Chat with the model
lmonade chat "Hello, how are you today?"

# Stream responses in real-time
lmonade chat --stream "Tell me a story about space"

# Use a specific model
lmonade chat --model TinyLlama-1.1B-Chat-v1.0 "Explain quantum computing"

# Start the API server (run a lmonade stand!)
lmonade stand

# List downloaded models
lmonade model list

# Show model information
lmonade model info TinyLlama-1.1B-Chat-v1.0

# Get help
lmonade --help
lmonade chat --help

API Server

Start the OpenAI-compatible API server:

# Start server (default port 8080)
lmonade serve

# Custom configuration
lmonade stand --host 0.0.0.0 --port 3000 --model TinyLlama-1.1B-Chat-v1.0

# Or build and run the binary directly
cargo build --release
./target/release/lmonade stand

Make requests to the API:

# Chat completion
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama-1.1B-Chat-v1.0",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Streaming
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama-1.1B-Chat-v1.0",
    "messages": [{"role": "user", "content": "Tell me a joke"}],
    "stream": true
  }'

Features

Fast Inference - Optimized Rust implementation with GPU acceleration
Actor-Based - Fault-tolerant architecture with supervision trees
True Streaming - Real-time token-by-token generation
OpenAI Compatible - Drop-in replacement API
Easy to Use - Simple CLI interface
Extensible - Modular design for adding new models

Documentation

Guide	Description
Getting Started	Installation and first steps
CLI Guide	Complete CLI reference
API Reference	HTTP API documentation
Architecture	System design and internals
Development	Contributing and extending

Supported Models

Model	Size	Status	Notes
TinyLlama-1.1B-Chat-v1.0	1.1B	Ready	Optimized for chat
Llama 2	7B-70B	In Progress	Coming soon
Mistral	7B	Planned	Q1 2025
Mixtral	8x7B	Planned	MoE support

Examples

Python Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="TinyLlama-1.1B-Chat-v1.0",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content, end='')

JavaScript/TypeScript

const response = await fetch('http://localhost:8080/v1/chat/completions', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
        model: 'TinyLlama-1.1B-Chat-v1.0',
        messages: [{ role: 'user', content: 'Hello!' }],
        stream: false
    })
});

const data = await response.json();
console.log(data.choices[0].message.content);

Performance

Throughput: ~1000 tokens/second on RTX 3090
First Token Latency: <50ms
Memory Efficient: KV cache with automatic management
Concurrent Requests: Efficient batching via actor system

System Requirements

OS: Linux, macOS, Windows
RAM: 8GB minimum (16GB recommended)
GPU: NVIDIA (CUDA 11.8+) or Apple Silicon (Metal)
Disk: 10GB for models
Rust: 1.75 or later

Building from Source

# Development build
cargo build

# Optimized release build
cargo build --release

# Run tests
cargo test

# Run with debug logging
RUST_LOG=debug cargo run --bin lmonade chat "Hello"

Project Structure

lmonade/
├── lmonade/                 # CLI application
├── lmonade-models/          # Model implementations
├── lmonade-runtime/         # Actor-based runtime
├── lmonade-server/          # HTTP API server
└── docs/                    # Documentation
    ├── getting-started/     # Installation & setup
    ├── cli/                 # CLI documentation
    ├── api/                 # API reference
    ├── architecture/        # System design
    └── development/         # Developer guides

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Key areas for contribution:

Model implementations
Performance optimizations
Documentation improvements
Test coverage
Internationalization

Community

Gitea Issues - Report bugs and feature requests
Repository - Source code and contributions

License

GPL v3.0 - See LICENSE for details.

Acknowledgments

Built with:

Candle - Tensor operations
Tokio - Async runtime
Axum - Web framework

Status: Beta - Core features working, optimizations ongoing

Commit count: 0