| Crates.io | gllm |
| lib.rs | gllm |
| version | 0.10.6 |
| created_at | 2025-11-28 17:01:44.016949+00 |
| updated_at | 2026-01-12 14:12:00.354562+00 |
| description | Pure Rust library for local embeddings, reranking, and text generation with MoE-optimized inference and aggressive performance tuning |
| homepage | https://github.com/putao520/gllm |
| repository | https://github.com/putao520/gllm |
| max_upload_size | |
| id | 1955736 |
| size | 879,530 |
gllm is a pure Rust library for local text embeddings, reranking, and text generation, built on the Burn deep learning framework. It provides an OpenAI SDK-style API with zero external C dependencies, supporting static compilation.
[dependencies]
gllm = "0.10"
| Feature | Default | Description |
|---|---|---|
wgpu |
Yes | GPU acceleration (Vulkan/DX12/Metal) |
cpu |
No | CPU-only inference (pure Rust) |
tokio |
No | Async interface support |
wgpu-detect |
No | GPU capabilities detection (VRAM, batch size) |
# CPU-only
gllm = { version = "0.10", features = ["cpu"] }
# With async
gllm = { version = "0.10", features = ["tokio"] }
# With GPU detection
gllm = { version = "0.10", features = ["wgpu-detect"] }
use gllm::Client;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new("bge-small-en")?;
let response = client
.embeddings(["What is machine learning?", "Neural networks explained"])
.generate()?;
for emb in response.embeddings {
println!("Vector: {} dimensions", emb.embedding.len());
}
Ok(())
}
use gllm::Client;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new("bge-reranker-v2")?;
let response = client
.rerank("What are renewable energy benefits?", [
"Solar power is clean and sustainable.",
"The stock market closed higher today.",
"Wind energy reduces carbon emissions.",
])
.top_n(2)
.return_documents(true)
.generate()?;
for result in response.results {
println!("Score: {:.4}", result.score);
}
Ok(())
}
[dependencies]
gllm = { version = "0.10", features = ["tokio"] }
tokio = { version = "1", features = ["rt-multi-thread", "macros"] }
use gllm::Client;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new("bge-small-en").await?;
let response = client
.embeddings(["Hello world"])
.generate()
.await?;
Ok(())
}
use gllm::{GpuCapabilities, GpuType};
// Detect GPU capabilities (cached after first call)
let caps = GpuCapabilities::detect();
println!("GPU: {} ({:?})", caps.name, caps.gpu_type);
println!("VRAM: {} MB", caps.vram_mb);
println!("Recommended batch size: {}", caps.recommended_batch_size);
if caps.gpu_available {
println!("Using {} backend", caps.backend_name);
}
use gllm::FallbackEmbedder;
// Automatically falls back to CPU if GPU OOMs
let embedder = FallbackEmbedder::new("bge-small-en").await?;
let vector = embedder.embed("Hello world").await?;
CodeXEmbed models are optimized for code semantic similarity, outperforming Voyage-Code by 20%+ on CoIR benchmark.
use gllm::Client;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// CodeXEmbed-400M (1024 dimensions, BERT-based)
let client = Client::new("codexembed-400m")?;
let code_snippets = [
"fn add(a: i32, b: i32) -> i32 { a + b }",
"def add(a, b): return a + b",
"function add(a, b) { return a + b; }",
];
let response = client.embeddings(code_snippets).generate()?;
// All 3 add functions will have high similarity scores
for emb in response.embeddings {
println!("Vector: {} dimensions", emb.embedding.len());
}
Ok(())
}
For larger models with higher accuracy:
// CodeXEmbed-2B (1536 dimensions, Qwen2-based decoder)
let client = Client::new("codexembed-2b")?;
// CodeXEmbed-7B (4096 dimensions, Mistral-based decoder)
let client = Client::new("codexembed-7b")?;
Qwen3 series provides state-of-the-art embeddings with decoder architecture and quantization support.
use gllm::Client;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Qwen3 Embedding - decoder-based LLM for high-quality embeddings
let client = Client::new("qwen3-embedding-0.6b")?; // 1024 dimensions
// let client = Client::new("qwen3-embedding-4b")?; // 2560 dimensions
// let client = Client::new("qwen3-embedding-8b")?; // 4096 dimensions
let texts = [
"Rust is a systems programming language",
"Python is great for machine learning",
"JavaScript runs in browsers",
];
let response = client.embeddings(texts).generate()?;
for (i, emb) in response.embeddings.iter().enumerate() {
println!("Text {}: {} dimensions", i, emb.embedding.len());
}
Ok(())
}
With quantization support for memory efficiency:
use gllm::registry;
// Quantized Qwen3 models (reduced memory, maintained quality)
let info = registry::resolve("qwen3-embedding-8b:int4")?; // Int4 quantization
let info = registry::resolve("qwen3-embedding-8b:int8")?; // Int8 quantization
let info = registry::resolve("qwen3-embedding-4b:awq")?; // AWQ quantization
High-accuracy document reranking with LLM-based cross-encoder:
use gllm::Client;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Qwen3 Reranker - LLM-based cross-encoder
let client = Client::new("qwen3-reranker-0.6b")?;
// let client = Client::new("qwen3-reranker-4b")?;
// let client = Client::new("qwen3-reranker-8b")?;
let response = client
.rerank("What is the capital of France?", [
"Paris is the capital and largest city of France.",
"London is the capital of the United Kingdom.",
"The Eiffel Tower is located in Paris.",
])
.top_n(2)
.generate()?;
for result in response.results {
println!("Rank {}: Score {:.4}", result.index, result.score);
}
Ok(())
}
Generate text using decoder-based LLMs like Qwen2.5, GLM-4, and Mistral:
use gllm::Client;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Qwen2.5 Instruct models (latest 2025)
let client = Client::new("qwen2.5-7b-instruct")?;
// let client = Client::new("qwen2.5-0.5b-instruct")?; // Lightweight
// let client = Client::new("qwen2.5-72b-instruct")?; // Largest
// GLM-4 Chat models
// let client = Client::new("glm-4-9b-chat")?;
// Legacy Qwen2/Mistral
// let client = Client::new("qwen2-7b-instruct")?;
// let client = Client::new("mistral-7b-instruct")?;
let response = client
.generate("Explain quantum computing in simple terms:")
.max_tokens(256)
.temperature(0.7)
.top_p(0.9)
.generate()?;
println!("{}", response.text);
println!("Tokens: {}", response.tokens.len());
Ok(())
}
With streaming support (coming soon):
// Future API for streaming
let stream = client
.generate("Write a poem about Rust:")
.max_tokens(100)
.stream()?;
for token in stream {
print!("{}", token?);
}
| Model | Alias | Dimensions | Architecture | Best For |
|---|---|---|---|---|
| BGE Small EN | bge-small-en |
384 | Encoder | Fast English |
| BGE Base EN | bge-base-en |
768 | Encoder | Balanced English |
| BGE Large EN | bge-large-en |
1024 | Encoder | High accuracy |
| BGE Small ZH | bge-small-zh |
512 | Encoder | Chinese |
| E5 Small | e5-small |
384 | Encoder | Instruction tuned |
| E5 Base | e5-base |
768 | Encoder | Instruction tuned |
| E5 Large | e5-large |
1024 | Encoder | Instruction tuned |
| MiniLM L6 | all-MiniLM-L6-v2 |
384 | Encoder | General purpose |
| MiniLM L12 | all-MiniLM-L12-v2 |
384 | Encoder | General (larger) |
| MPNet Base | all-mpnet-base-v2 |
768 | Encoder | High quality |
| JINA v2 Base | jina-embeddings-v2-base-en |
768 | Encoder | Modern arch |
| JINA v2 Small | jina-embeddings-v2-small-en |
384 | Encoder | Lightweight |
| JINA v4 | jina-embeddings-v4 |
2048 | Encoder | Latest JINA |
| Qwen3 0.6B | qwen3-embedding-0.6b |
1024 | Encoder | Lightweight |
| Qwen3 4B | qwen3-embedding-4b |
2560 | Encoder | Balanced |
| Qwen3 8B | qwen3-embedding-8b |
4096 | Encoder | High accuracy |
| Nemotron 8B | llama-embed-nemotron-8b |
4096 | Encoder | State-of-the-art |
| M3E Base | m3e-base |
768 | Encoder | Chinese quality |
| Multilingual | multilingual-MiniLM-L12-v2 |
384 | Encoder | 50+ languages |
| Model | Alias | Dimensions | Architecture | Best For |
|---|---|---|---|---|
| CodeXEmbed 400M | codexembed-400m |
1024 | Encoder (BERT) | Fast code search |
| CodeXEmbed 2B | codexembed-2b |
1536 | Decoder (Qwen2) | Balanced code |
| CodeXEmbed 7B | codexembed-7b |
4096 | Decoder (Mistral) | High accuracy code |
| GraphCodeBERT | graphcodebert-base |
768 | Encoder | Legacy code |
CodeXEmbed (SFR-Embedding-Code) is the 2024 state-of-the-art for code embedding, outperforming Voyage-Code by 20%+ on CoIR benchmark.
| Model | Alias | Parameters | Architecture | Best For |
|---|---|---|---|---|
| Qwen3 Series (2025) | ||||
| Qwen3 0.6B | qwen3-0.6b |
0.6B | Decoder (Qwen3) | Ultra-fast generation |
| Qwen3 1.7B | qwen3-1.7b |
1.7B | Decoder (Qwen3) | Lightweight |
| Qwen3 4B | qwen3-4b |
4B | Decoder (Qwen3) | Balanced |
| Qwen3 8B | qwen3-8b |
8B | Decoder (Qwen3) | High quality |
| Qwen3 14B | qwen3-14b |
14B | Decoder (Qwen3) | Very high quality |
| Qwen3 32B | qwen3-32b |
32B | Decoder (Qwen3) | Premium quality |
| Qwen2.5 Series | ||||
| Qwen2.5 0.5B Instruct | qwen2.5-0.5b-instruct |
0.5B | Decoder (Qwen2) | Fast generation |
| Qwen2.5 1.5B Instruct | qwen2.5-1.5b-instruct |
1.5B | Decoder (Qwen2) | Lightweight |
| Qwen2.5 3B Instruct | qwen2.5-3b-instruct |
3B | Decoder (Qwen2) | Balanced |
| Qwen2.5 7B Instruct | qwen2.5-7b-instruct |
7B | Decoder (Qwen2) | High quality |
| Qwen2.5 14B Instruct | qwen2.5-14b-instruct |
14B | Decoder (Qwen2) | Very high quality |
| Qwen2.5 32B Instruct | qwen2.5-32b-instruct |
32B | Decoder (Qwen2) | Premium quality |
| Qwen2.5 72B Instruct | qwen2.5-72b-instruct |
72B | Decoder (Qwen2) | Maximum quality |
| Phi-4 Series (2025) | ||||
| Phi-4 | phi-4 |
14B | Decoder (Phi3) | Microsoft flagship |
| Phi-4 Mini Instruct | phi-4-mini-instruct |
3.8B | Decoder (Phi3) | Efficient reasoning |
| Other 2025 Models | ||||
| SmolLM3 3B | smollm3-3b |
3B | Decoder (SmolLM3) | HuggingFace efficient |
| InternLM3 8B Instruct | internlm3-8b-instruct |
8B | Decoder (InternLM3) | Chinese & English |
| GLM-4 9B Chat | glm-4-9b-chat |
9B | Decoder (GLM4) | Zhipu AI flagship |
| Legacy Models | ||||
| Qwen2 7B Instruct | qwen2-7b-instruct |
7B | Decoder (Qwen2) | Legacy |
| Mistral 7B Instruct | mistral-7b-instruct |
7B | Decoder (Mistral) | Legacy |
Qwen3 (2025) is the latest state-of-the-art open-source LLM with 40K context and hybrid thinking modes. Phi-4 (2025) is Microsoft's flagship small model with exceptional reasoning capabilities. SmolLM3 and InternLM3 are efficient 2025 models optimized for edge deployment.
| Model | Alias | Total/Active Params | Experts | Best For |
|---|---|---|---|---|
| GLM-4.7 | glm-4.7 |
400B/40B | 160 (top-8) | Zhipu AI flagship MoE |
| Qwen3 30B-A3B | qwen3-30b-a3b |
30B/3B | MoE | Efficient large model |
| Qwen3 235B-A22B | qwen3-235b-a22b |
235B/22B | MoE | Maximum quality |
| Mixtral 8x7B Instruct | mixtral-8x7b-instruct |
47B/13B | 8 | Mistral flagship |
| Mixtral 8x22B Instruct | mixtral-8x22b-instruct |
176B/39B | 8 | Largest Mixtral |
| DeepSeek-V3 | deepseek-v3 |
671B/37B | 256 (top-8) | DeepSeek flagship |
MoE Architecture enables running massive models efficiently by activating only a subset of experts per token. GLM-4.7 activates 8 of 160 experts + 1 shared expert per token, achieving 400B quality with 40B compute.
use gllm::Client;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// GLM-4.7 MoE model (activates 8/160 experts per token)
let client = Client::new("glm-4.7")?;
let response = client
.generate("Explain mixture of experts architecture:")
.max_tokens(256)
.generate()?;
println!("{}", response.text);
Ok(())
}
| Model | Alias | Speed | Best For |
|---|---|---|---|
| BGE Reranker v2 | bge-reranker-v2 |
Medium | Multilingual |
| BGE Reranker Large | bge-reranker-large |
Slow | High accuracy |
| BGE Reranker Base | bge-reranker-base |
Fast | Quick reranking |
| MS MARCO MiniLM L6 | ms-marco-MiniLM-L-6-v2 |
Fast | Search |
| MS MARCO MiniLM L12 | ms-marco-MiniLM-L-12-v2 |
Medium | Better search |
| MS MARCO TinyBERT | ms-marco-TinyBERT-L-2-v2 |
Very Fast | Lightweight |
| Qwen3 Reranker 0.6B | qwen3-reranker-0.6b |
Fast | Lightweight |
| Qwen3 Reranker 4B | qwen3-reranker-4b |
Medium | Balanced |
| Qwen3 Reranker 8B | qwen3-reranker-8b |
Slow | High accuracy |
| JINA Reranker v3 | jina-reranker-v3 |
Medium | Latest JINA |
// Any HuggingFace SafeTensors model
let client = Client::new("sentence-transformers/all-MiniLM-L6-v2")?;
// Or use colon notation
let client = Client::new("sentence-transformers:all-MiniLM-L6-v2")?;
use gllm::ModelRegistry;
let registry = ModelRegistry::new();
// Use :suffix for quantized variants
let info = registry.resolve("qwen3-embedding-8b:int4")?; // Int4
let info = registry.resolve("qwen3-embedding-8b:awq")?; // AWQ
let info = registry.resolve("qwen3-reranker-4b:gptq")?; // GPTQ
Supported quantization types: :int4, :int8, :awq, :gptq, :gguf, :fp8, :bnb4, :bnb8
Models with quantization: Qwen3 Embedding/Reranker series, Nemotron 8B
use gllm::{Client, ClientConfig, Device};
let config = ClientConfig {
models_dir: "/custom/path".into(),
device: Device::Auto, // or Device::Cpu, Device::Gpu
};
let client = Client::with_config("bge-small-en", config)?;
let query_vec = client.embeddings(["search query"]).generate()?.embeddings[0].embedding.clone();
let doc_vecs = client.embeddings(documents).generate()?;
// Calculate cosine similarities
for (i, doc) in doc_vecs.embeddings.iter().enumerate() {
let sim = cosine_similarity(&query_vec, &doc.embedding);
println!("Doc {}: {:.4}", i, sim);
}
Models are cached in ~/.gllm/models/:
~/.gllm/models/
├── BAAI--bge-small-en-v1.5/
│ ├── model.safetensors
│ ├── config.json
│ └── tokenizer.json
└── ...
| Backend | Device | Throughput (512 tokens) |
|---|---|---|
| WGPU | RTX 4090 | ~150 texts/sec |
| WGPU | Apple M2 | ~45 texts/sec |
| CPU | Intel i7-12700K | ~8 texts/sec |
cargo test --lib # Unit tests
cargo test --test integration # Integration tests
cargo test -- --ignored # E2E tests (downloads models)
MIT License - see LICENSE
Built with Rust 🦀