blazr

Crates.io	blazr
lib.rs	blazr
version	0.1.0-beta.1
created_at	2025-12-05 10:30:59.236553+00
updated_at	2025-12-05 10:30:59.236553+00
description	Blazing-fast inference server for oxidizr models (Mamba2 + MLA + MoE)
homepage	https://github.com/farhan-syah/blazr
repository	https://github.com/farhan-syah/blazr
max_upload_size
id	1967989
size	338,957

Farhan Syah (farhan-syah)

documentation

README

blazr

A blazing-fast inference server for hybrid neural architectures, supporting Mamba2 SSM, Multi-Head Latent Attention (MLA), Mixture of Experts (MoE), and standard transformers.

Features

Auto-detection - Automatically detects model architecture from tensor names (no manual configuration required)
Hybrid Architecture Support - Seamlessly handles mixed Mamba2 and attention layers in a single model
OpenAI-Compatible API - Drop-in replacement with /v1/completions and /v1/chat/completions endpoints
High Performance - Written in Rust using the Candle ML framework with optional CUDA acceleration
Flexible - Supports custom oxidizr-trained models and standard Llama-style transformers

Quick Start

Installation

# Clone the repository
git clone https://github.com/farhan-syah/blazr.git
cd blazr

# Build (CPU-only)
cargo build --release

# Build with CUDA support (requires CUDA 12.x)
cargo build --release --features cuda

Basic Usage

Generate Text

blazr generate \
  --model ./checkpoints/nano \
  --prompt "Once upon a time" \
  --max-tokens 100

Start Server

blazr serve --model ./checkpoints/nano --port 8080

Then make API requests:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Hello, world!",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Model Info

blazr info --model ./checkpoints/nano

Supported Architectures

blazr auto-detects and supports:

Mamba2 - State Space Models with selective attention
MLA - Multi-Head Latent Attention with compressed KV cache
MoE - Mixture of Experts with top-k routing and optional shared expert
Standard Transformers - GQA (Grouped Query Attention) with MLP layers

Models can mix and match these layer types freely.

Documentation

API Reference - Complete API endpoint documentation
Architecture - Technical details on hybrid model support
Configuration - Model configuration and tuning options

CLI Commands

# Generate text from a prompt
blazr generate --model <path> --prompt "text" [OPTIONS]

# Start inference server
blazr serve --model <path> [--port 8080] [--host 0.0.0.0]

# Display model configuration
blazr info --model <path>

Options

--max-tokens - Maximum tokens to generate (default: 100)
--temperature - Sampling temperature (default: 0.7)
--top-p - Nucleus sampling threshold (default: 0.9)
--top-k - Top-k sampling (default: 40)
--cpu - Force CPU inference even if CUDA is available

Model Format

blazr loads models from SafeTensors checkpoints:

checkpoint_dir/
├── model.safetensors    # Model weights
└── config.json          # Model configuration (optional)

If config.json is missing, blazr will auto-detect the architecture from tensor names.

Requirements

Rust 1.70 or later
(Optional) CUDA 12.x for GPU acceleration

License

Apache-2.0 License - see LICENSE for details.

Related Projects

oxidizr - Training framework for hybrid Mamba2 + MLA + MoE architectures
splintr - High-performance BPE tokenizer with Python bindings

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Commit count: 0

blazr

documentation

README

blazr

Features

Quick Start

Installation

Basic Usage

Generate Text

Start Server

Model Info

Supported Architectures

Documentation

CLI Commands

Options

Model Format

Requirements

License

Related Projects

Contributing

cargo fmt