| Crates.io | mullama |
| lib.rs | mullama |
| version | 0.1.1 |
| created_at | 2025-12-09 22:48:47.751359+00 |
| updated_at | 2026-01-17 01:59:28.151477+00 |
| description | Comprehensive Rust bindings for llama.cpp with memory-safe API and advanced features |
| homepage | https://github.com/neul-labs/mullama |
| repository | https://github.com/neul-labs/mullama |
| max_upload_size | |
| id | 1976889 |
| size | 1,880,399 |
Comprehensive Rust bindings for llama.cpp with advanced integration features
Mullama provides memory-safe Rust bindings for llama.cpp with production-ready features including async/await support, real-time streaming, multimodal processing, and web framework integration.
Most llama.cpp Rust bindings expose low-level C APIs directly. Mullama provides an idiomatic Rust experience:
// Other wrappers: manual memory management, raw pointers, verbose setup
let params = llama_context_default_params();
let ctx = unsafe { llama_new_context_with_model(model, params) };
let tokens = unsafe { llama_tokenize(model, text.as_ptr(), ...) };
// Don't forget to free everything...
// Mullama: builder patterns, async/await, automatic resource management
let model = ModelBuilder::new()
.path("model.gguf")
.gpu_layers(35)
.build().await?;
let response = model.generate("Hello", 100).await?;
Developer experience improvements:
| Feature | Other Wrappers | Mullama |
|---|---|---|
| API Style | Raw FFI / C-like | Builder patterns, fluent API |
| Async Support | Manual threading | Native async/await with Tokio |
| Error Handling | Error codes / panics | Result<T, MullamaError> with context |
| Memory Management | Manual free/cleanup | Automatic RAII |
| Streaming | Callbacks | Stream trait, async iterators |
| Configuration | Struct fields | Type-safe builders with validation |
| Web Integration | DIY | Built-in Axum routes |
[dependencies]
mullama = "0.1.1"
# With all features
mullama = { version = "0.1.1", features = ["full"] }
Linux (Ubuntu/Debian):
sudo apt install -y build-essential cmake pkg-config libasound2-dev libpulse-dev
macOS:
brew install cmake pkg-config portaudio
Windows: Install Visual Studio Build Tools and CMake.
See Platform Setup Guide for detailed instructions.
use mullama::prelude::*;
#[tokio::main]
async fn main() -> Result<(), MullamaError> {
let model = ModelBuilder::new()
.path("model.gguf")
.context_size(4096)
.build().await?;
let response = model.generate("The future of AI is", 100).await?;
println!("{}", response);
Ok(())
}
[dependencies.mullama]
version = "0.1.1"
features = [
"async", # Async/await support
"streaming", # Token streaming
"web", # Axum web framework
"websockets", # WebSocket support
"multimodal", # Image and audio processing
"streaming-audio", # Real-time audio capture
"format-conversion", # Audio/image format conversion
"parallel", # Rayon parallel processing
"late-interaction", # ColBERT-style multi-vector embeddings
"daemon", # Daemon mode with TUI client
"full" # All features
]
# Web applications
features = ["web", "websockets", "async", "streaming"]
# Multimodal AI
features = ["multimodal", "streaming-audio", "format-conversion"]
# High-performance batch processing
features = ["parallel", "async"]
# Semantic search / RAG with ColBERT-style retrieval
features = ["late-interaction", "parallel"]
# Daemon with TUI chat interface
features = ["daemon"]
Mullama includes a multi-model daemon with OpenAI-compatible HTTP API and TUI client:
# Build the CLI
cargo build --release --features daemon
# Start daemon with local model
mullama serve --model llama:./llama.gguf
# Start with HuggingFace model (auto-downloads and caches)
mullama serve --model hf:TheBloke/Llama-2-7B-GGUF
# Multiple models with custom aliases
mullama serve \
--model llama:hf:TheBloke/Llama-2-7B-GGUF:llama-2-7b.Q4_K_M.gguf \
--model mistral:hf:TheBloke/Mistral-7B-v0.1-GGUF
# Interactive TUI chat
mullama chat
# One-shot generation
mullama run "What is the meaning of life?"
# Model management
mullama models # List loaded models
mullama load phi:./phi.gguf # Load a model
mullama unload phi # Unload a model
mullama default llama # Set default model
# Search for models on HuggingFace
mullama search "llama 7b" # Search GGUF models
mullama search "mistral" --files # Show available files
mullama search "phi" --all # Include non-GGUF models
mullama info TheBloke/Llama-2-7B-GGUF # Show repo details
# Cache management
mullama pull hf:TheBloke/Llama-2-7B-GGUF # Pre-download model
mullama cache list # List cached models
mullama cache size # Show cache size
mullama cache clear # Clear cache
# Use OpenAI-compatible API
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama", "messages": [{"role": "user", "content": "Hello!"}]}'
hf:<owner>/<repo>:<filename> # Specific file
hf:<owner>/<repo> # Auto-detect best GGUF
<alias>:hf:<owner>/<repo> # With custom alias
| Variable | Description |
|---|---|
HF_TOKEN |
HuggingFace token for gated/private models |
MULLAMA_CACHE_DIR |
Override default cache directory |
| Platform | Default Location |
|---|---|
| Linux | $XDG_CACHE_HOME/mullama/models or ~/.cache/mullama/models |
| macOS | ~/Library/Caches/mullama/models |
| Windows | %LOCALAPPDATA%\mullama\models |
Architecture:
┌──────────────────────────────────┐
│ Daemon │
┌─────────────┐ │ ┌────────────────────────────┐ │
│ TUI Client │◄── nng (IPC) ─────►│ │ Model Manager │ │
└─────────────┘ │ │ ┌───────┐ ┌───────┐ │ │
│ │ │Model 1│ │Model 2│ ... │ │
┌─────────────┐ │ │ └───────┘ └───────┘ │ │
│ curl/app │◄── HTTP/REST ─────►│ └────────────────────────────┘ │
└─────────────┘ (OpenAI API) │ │
│ Endpoints: │
┌─────────────┐ │ • /v1/chat/completions │
│ Other Client│◄── nng (IPC) ─────►│ • /v1/completions │
└─────────────┘ │ • /v1/models │
│ • /v1/embeddings │
└──────────────────────────────────┘
Programmatic usage:
use mullama::daemon::{DaemonClient, DaemonBuilder};
// Connect as client
let client = DaemonClient::connect_default()?;
let result = client.chat("Hello, AI!", None, 100, 0.7)?;
println!("{} ({:.1} tok/s)", result.text, result.tokens_per_second());
// List models
for model in client.list_models()? {
println!("{}: {}M params", model.alias, model.info.parameters / 1_000_000);
}
Mullama supports ColBERT-style late interaction retrieval with multi-vector embeddings. Unlike traditional embeddings that pool all tokens into a single vector, late interaction preserves per-token embeddings for fine-grained matching using MaxSim scoring.
use mullama::late_interaction::{
MultiVectorGenerator, MultiVectorConfig, LateInteractionScorer
};
use std::sync::Arc;
// Create generator (works with any embedding model)
let model = Arc::new(Model::load("model.gguf")?);
let config = MultiVectorConfig::default()
.normalize(true)
.skip_special_tokens(true);
let mut generator = MultiVectorGenerator::new(model, config)?;
// Generate multi-vector embeddings
let query = generator.embed_text("What is machine learning?")?;
let doc = generator.embed_text("Machine learning is a branch of AI...")?;
// Score with MaxSim
let score = LateInteractionScorer::max_sim(&query, &doc);
// Top-k retrieval
let documents: Vec<_> = texts.iter()
.map(|t| generator.embed_text(t))
.collect::<Result<Vec<_>, _>>()?;
let top_k = LateInteractionScorer::find_top_k(&query, &documents, 10);
With parallel processing:
// Enable both features: ["late-interaction", "parallel"]
let top_k = LateInteractionScorer::find_top_k_parallel(&query, &documents, 10);
let scores = LateInteractionScorer::batch_score_parallel(&queries, &documents);
Recommended models:
LiquidAI/LFM2-ColBERT-350M-GGUF - Purpose-trained ColBERT model# NVIDIA CUDA
export LLAMA_CUDA=1
# Apple Metal (macOS)
export LLAMA_METAL=1
# AMD ROCm (Linux)
export LLAMA_HIPBLAS=1
# Intel OpenCL
export LLAMA_CLBLAST=1
| Document | Description |
|---|---|
| Getting Started | Installation and first application |
| Platform Setup | OS-specific setup instructions |
| Features Guide | Integration features overview |
| Use Cases | Real-world application examples |
| API Reference | Complete API documentation |
| Sampling Guide | Sampling strategies and configuration |
| GPU Guide | GPU acceleration setup |
| Feature Status | Implementation status and roadmap |
# Basic text generation
cargo run --example simple --features async
# Streaming responses
cargo run --example streaming_generation --features "async,streaming"
# Web service
cargo run --example web_service --features "web,websockets"
# Audio processing
cargo run --example streaming_audio_demo --features "streaming-audio,multimodal"
# Late interaction / ColBERT retrieval
cargo run --example late_interaction --features late-interaction
cargo run --example late_interaction --features late-interaction -- model.gguf
We welcome contributions! See CONTRIBUTING.md for guidelines.
git clone --recurse-submodules https://github.com/neul-labs/mullama.git
cd mullama
cargo test --all-features
MIT License - see LICENSE for details.
Mullama tracks upstream llama.cpp releases:
| Mullama Version | llama.cpp Version | Release Date |
|---|---|---|
| 0.1.x | b7542 | Dec 2025 |
All architectures supported by llama.cpp b7542, including: