blazr

Crates.ioblazr
lib.rsblazr
version0.1.0-beta.1
created_at2025-12-05 10:30:59.236553+00
updated_at2025-12-05 10:30:59.236553+00
descriptionBlazing-fast inference server for oxidizr models (Mamba2 + MLA + MoE)
homepagehttps://github.com/farhan-syah/blazr
repositoryhttps://github.com/farhan-syah/blazr
max_upload_size
id1967989
size338,957
Farhan Syah (farhan-syah)

documentation

README

blazr

Rust License: Apache-2.0 Build Status

A blazing-fast inference server for hybrid neural architectures, supporting Mamba2 SSM, Multi-Head Latent Attention (MLA), Mixture of Experts (MoE), and standard transformers.

Features

  • Auto-detection - Automatically detects model architecture from tensor names (no manual configuration required)
  • Hybrid Architecture Support - Seamlessly handles mixed Mamba2 and attention layers in a single model
  • OpenAI-Compatible API - Drop-in replacement with /v1/completions and /v1/chat/completions endpoints
  • High Performance - Written in Rust using the Candle ML framework with optional CUDA acceleration
  • Flexible - Supports custom oxidizr-trained models and standard Llama-style transformers

Quick Start

Installation

# Clone the repository
git clone https://github.com/farhan-syah/blazr.git
cd blazr

# Build (CPU-only)
cargo build --release

# Build with CUDA support (requires CUDA 12.x)
cargo build --release --features cuda

Basic Usage

Generate Text

blazr generate \
  --model ./checkpoints/nano \
  --prompt "Once upon a time" \
  --max-tokens 100

Start Server

blazr serve --model ./checkpoints/nano --port 8080

Then make API requests:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Hello, world!",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Model Info

blazr info --model ./checkpoints/nano

Supported Architectures

blazr auto-detects and supports:

  • Mamba2 - State Space Models with selective attention
  • MLA - Multi-Head Latent Attention with compressed KV cache
  • MoE - Mixture of Experts with top-k routing and optional shared expert
  • Standard Transformers - GQA (Grouped Query Attention) with MLP layers

Models can mix and match these layer types freely.

Documentation

CLI Commands

# Generate text from a prompt
blazr generate --model <path> --prompt "text" [OPTIONS]

# Start inference server
blazr serve --model <path> [--port 8080] [--host 0.0.0.0]

# Display model configuration
blazr info --model <path>

Options

  • --max-tokens - Maximum tokens to generate (default: 100)
  • --temperature - Sampling temperature (default: 0.7)
  • --top-p - Nucleus sampling threshold (default: 0.9)
  • --top-k - Top-k sampling (default: 40)
  • --cpu - Force CPU inference even if CUDA is available

Model Format

blazr loads models from SafeTensors checkpoints:

checkpoint_dir/
├── model.safetensors    # Model weights
└── config.json          # Model configuration (optional)

If config.json is missing, blazr will auto-detect the architecture from tensor names.

Requirements

  • Rust 1.70 or later
  • (Optional) CUDA 12.x for GPU acceleration

License

Apache-2.0 License - see LICENSE for details.

Related Projects

  • oxidizr - Training framework for hybrid Mamba2 + MLA + MoE architectures
  • splintr - High-performance BPE tokenizer with Python bindings

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Commit count: 0

cargo fmt