lmonade

Crates.iolmonade
lib.rslmonade
version0.1.0-alpha.2
created_at2025-08-20 14:13:31.939927+00
updated_at2025-08-21 22:08:24.779015+00
descriptionLLM inference engine - main crate with CLI and re-exports
homepage
repositoryhttps://jgok76.gitea.cloud/femtomc/lmonade
max_upload_size
id1803474
size152,861
McCoy R. Becker (femtomc)

documentation

README

Lmonade

An LLM inference engine built in Rust with an actor-based architecture.

Quick Start

Installation

# Clone and build
git clone https://jgok76.gitea.cloud/femtomc/lmonade.git
cd lmonade
cargo build --release

# The CLI will be available at ./target/release/lmonade

CLI Usage

# Download a model
lmonade model download TinyLlama/TinyLlama-1.1B-Chat-v1.0

# Chat with the model
lmonade chat "Hello, how are you today?"

# Stream responses in real-time
lmonade chat --stream "Tell me a story about space"

# Use a specific model
lmonade chat --model TinyLlama-1.1B-Chat-v1.0 "Explain quantum computing"

# Start the API server (run a lmonade stand!)
lmonade stand

# List downloaded models
lmonade model list

# Show model information
lmonade model info TinyLlama-1.1B-Chat-v1.0

# Get help
lmonade --help
lmonade chat --help

API Server

Start the OpenAI-compatible API server:

# Start server (default port 8080)
lmonade serve

# Custom configuration
lmonade stand --host 0.0.0.0 --port 3000 --model TinyLlama-1.1B-Chat-v1.0

# Or build and run the binary directly
cargo build --release
./target/release/lmonade stand

Make requests to the API:

# Chat completion
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama-1.1B-Chat-v1.0",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Streaming
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama-1.1B-Chat-v1.0",
    "messages": [{"role": "user", "content": "Tell me a joke"}],
    "stream": true
  }'

Features

  • Fast Inference - Optimized Rust implementation with GPU acceleration
  • Actor-Based - Fault-tolerant architecture with supervision trees
  • True Streaming - Real-time token-by-token generation
  • OpenAI Compatible - Drop-in replacement API
  • Easy to Use - Simple CLI interface
  • Extensible - Modular design for adding new models

Documentation

Guide Description
Getting Started Installation and first steps
CLI Guide Complete CLI reference
API Reference HTTP API documentation
Architecture System design and internals
Development Contributing and extending

Supported Models

Model Size Status Notes
TinyLlama-1.1B-Chat-v1.0 1.1B Ready Optimized for chat
Llama 2 7B-70B In Progress Coming soon
Mistral 7B Planned Q1 2025
Mixtral 8x7B Planned MoE support

Examples

Python Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="TinyLlama-1.1B-Chat-v1.0",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content, end='')

JavaScript/TypeScript

const response = await fetch('http://localhost:8080/v1/chat/completions', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
        model: 'TinyLlama-1.1B-Chat-v1.0',
        messages: [{ role: 'user', content: 'Hello!' }],
        stream: false
    })
});

const data = await response.json();
console.log(data.choices[0].message.content);

Performance

  • Throughput: ~1000 tokens/second on RTX 3090
  • First Token Latency: <50ms
  • Memory Efficient: KV cache with automatic management
  • Concurrent Requests: Efficient batching via actor system

System Requirements

  • OS: Linux, macOS, Windows
  • RAM: 8GB minimum (16GB recommended)
  • GPU: NVIDIA (CUDA 11.8+) or Apple Silicon (Metal)
  • Disk: 10GB for models
  • Rust: 1.75 or later

Building from Source

# Development build
cargo build

# Optimized release build
cargo build --release

# Run tests
cargo test

# Run with debug logging
RUST_LOG=debug cargo run --bin lmonade chat "Hello"

Project Structure

lmonade/
├── lmonade/                 # CLI application
├── lmonade-models/          # Model implementations
├── lmonade-runtime/         # Actor-based runtime
├── lmonade-server/          # HTTP API server
└── docs/                    # Documentation
    ├── getting-started/     # Installation & setup
    ├── cli/                 # CLI documentation
    ├── api/                 # API reference
    ├── architecture/        # System design
    └── development/         # Developer guides

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Key areas for contribution:

  • Model implementations
  • Performance optimizations
  • Documentation improvements
  • Test coverage
  • Internationalization

Community

License

GPL v3.0 - See LICENSE for details.

Acknowledgments

Built with:


Status: Beta - Core features working, optimizations ongoing

Commit count: 0

cargo fmt