bitmamba

Crates.iobitmamba
lib.rsbitmamba
version0.1.0
created_at2026-01-09 00:24:52.792616+00
updated_at2026-01-09 00:24:52.792616+00
descriptionBitMamba: 1.58-bit Mamba language model with infinite context window - includes OpenAI-compatible API server
homepagehttps://github.com/rileyseaburg/bitmamba
repositoryhttps://github.com/rileyseaburg/bitmamba
max_upload_size
id2031337
size116,341
Riley Seaburg (rileyseaburg)

documentation

https://docs.rs/bitmamba

README

BitMamba

A 1.58-bit Mamba language model with infinite context window, implemented in Rust.

Crates.io Documentation License: MIT

Features

  • Infinite Context Window - Mamba's SSM maintains fixed-size state regardless of sequence length
  • 1.58-bit Weights - BitNet-style quantization for efficient inference
  • CPU Inference - No GPU required
  • OpenAI-Compatible API - Drop-in replacement for OpenAI API, works with Cline, Continue, etc.
  • Streaming Support - Server-Sent Events for real-time token generation

Installation

cargo install bitmamba

Or build from source:

git clone https://github.com/rileyseaburg/bitmamba
cd bitmamba
cargo build --release

Usage

CLI

# Run inference directly
bitmamba

OpenAI-Compatible Server

# Start the server
bitmamba-server

The server runs at http://localhost:8000 with these endpoints:

Endpoint Method Description
/v1/models GET List available models
/v1/chat/completions POST Chat completions (streaming supported)
/v1/completions POST Text completions
/health GET Health check

Configure with Cline/Continue

{
  "apiProvider": "openai-compatible",
  "baseUrl": "http://localhost:8000/v1",
  "model": "bitmamba-student"
}

As a Library

use bitmamba::{BitMambaStudent, load_model, load_tokenizer};

fn main() -> anyhow::Result<()> {
    let (model, tokenizer) = bitmamba::load()?;
    
    let prompt = "def fibonacci(n):";
    let tokens = tokenizer.encode(prompt, true)?;
    let output = model.generate(tokens.get_ids(), 50, 0.7)?;
    
    println!("{}", tokenizer.decode(&output, true)?);
    Ok(())
}

Model

The default model is rileyseaburg/bitmamba-student on Hugging Face, a 278M parameter BitMamba model distilled from Qwen2.5-Coder-1.5B.

Architecture

  • Hidden Size: 768
  • Layers: 12 BitMamba blocks
  • State Size: 16 (SSM state dimension)
  • Expand Factor: 2
  • Vocab Size: 151,665 (Qwen tokenizer)

BitMamba Block

Input -> RMSNorm -> BitLinear (in_proj) -> Conv1d -> SiLU -> SSM Scan -> Gate -> BitLinear (out_proj) -> Residual

The SSM Scan is the key component that enables infinite context:

// Fixed-size state, O(1) memory per token
h = dA * h + dB * x  // State update
y = h @ C + D * x     // Output

Performance

Metric Value
Parameters 278M
Memory (inference) ~1.1 GB
Context Window Unlimited
Quantization 1.58-bit weights

Citation

If you use BitMamba in your research, please cite:

@software{bitmamba2024,
  author = {Seaburg, Riley},
  title = {BitMamba: 1.58-bit Mamba with Infinite Context},
  year = {2024},
  url = {https://github.com/rileyseaburg/bitmamba}
}

Related Work

License

MIT License - see LICENSE for details.

Commit count: 2

cargo fmt