Crates.io | embellama |
lib.rs | embellama |
version | 0.4.0 |
created_at | 2025-09-03 21:27:26.90683+00 |
updated_at | 2025-09-25 19:10:15.38691+00 |
description | High-performance Rust library for generating text embeddings using llama-cpp |
homepage | |
repository | https://github.com/darjus/embellama |
max_upload_size | |
id | 1823298 |
size | 872,203 |
High-performance Rust library for generating text embeddings using llama-cpp.
use embellama::{EmbeddingEngine, EngineConfig};
// Create configuration
let config = EngineConfig::builder()
.with_model_path("/path/to/model.gguf")
.with_model_name("my-model")
.with_normalize_embeddings(true)
.build()?;
// Create engine (uses singleton pattern internally)
let engine = EmbeddingEngine::new(config)?;
// Generate single embedding
let text = "Hello, world!";
let embedding = engine.embed(None, text)?;
// Generate batch embeddings
let texts = vec!["Text 1", "Text 2", "Text 3"];
let embeddings = engine.embed_batch(None, texts)?;
The engine can optionally use a singleton pattern for shared access across your application. The singleton methods return Arc<Mutex<EmbeddingEngine>>
for thread-safe access:
// Get or initialize singleton instance (returns Arc<Mutex<EmbeddingEngine>>)
let engine = EmbeddingEngine::get_or_init(config)?;
// Access the singleton from anywhere in your application
let engine_clone = EmbeddingEngine::instance()
.expect("Engine not initialized");
// Use the engine (requires locking the mutex)
let embedding = {
let engine_guard = engine.lock().unwrap();
engine_guard.embed(None, "text")?
};
The library has been tested with the following GGUF models:
Both BERT-style and LLaMA-style embedding models are supported.
Add this to your Cargo.toml
:
[dependencies]
embellama = "0.4.0"
The library supports multiple backends for hardware acceleration. By default, it uses OpenMP for CPU parallelization. You can enable specific backends based on your hardware:
# Default - OpenMP CPU parallelization
embellama = "0.4.0"
# macOS Metal GPU acceleration
embellama = { version = "0.4.0", features = ["metal"] }
# NVIDIA CUDA GPU acceleration
embellama = { version = "0.4.0", features = ["cuda"] }
# Vulkan GPU acceleration (cross-platform)
embellama = { version = "0.4.0", features = ["vulkan"] }
# Native CPU optimizations
embellama = { version = "0.4.0", features = ["native"] }
# CPU-optimized build (native + OpenMP)
embellama = { version = "0.4.0", features = ["cpu-optimized"] }
Note: GPU backends (Metal, CUDA, Vulkan) are mutually exclusive. Use only one at a time for best results.
let config = EngineConfig::builder()
.with_model_path("/path/to/model.gguf")
.with_model_name("my-model")
.build()?;
let config = EngineConfig::builder()
.with_model_path("/path/to/model.gguf")
.with_model_name("my-model")
.with_context_size(2048) // Model context window (usize)
.with_n_threads(8) // CPU threads (usize)
.with_use_gpu(true) // Enable GPU acceleration
.with_n_gpu_layers(32) // Layers to offload to GPU (u32)
.with_batch_size(64) // Batch processing size (usize)
.with_normalize_embeddings(true) // L2 normalize embeddings
.with_pooling_strategy(PoolingStrategy::Mean) // Pooling method
.with_add_bos_token(Some(false)) // Disable BOS for encoder models (Option<bool>)
.build()?;
The library can automatically detect and use the best available backend:
use embellama::{EngineConfig, detect_best_backend, BackendInfo};
// Automatic backend detection
let config = EngineConfig::with_backend_detection()
.with_model_path("/path/to/model.gguf")
.with_model_name("my-model")
.build()?;
// Check which backend was selected
let backend_info = BackendInfo::new();
println!("Using backend: {}", backend_info.backend);
println!("Available features: {:?}", backend_info.available_features);
The library automatically detects model types and applies appropriate BOS token handling:
Encoder Models (BERT, E5, BGE, GTE, MiniLM, etc.):
Decoder Models (LLaMA, Mistral, Vicuna, etc.):
Manual Override:
// Force disable BOS for a specific model
let config = EngineConfig::builder()
.with_model_path("/path/to/model.gguf")
.with_model_name("custom-encoder")
.with_add_bos_token(Some(false)) // Explicitly disable BOS
.build()?;
// Force enable BOS
let config = EngineConfig::builder()
.with_model_path("/path/to/model.gguf")
.with_model_name("custom-decoder")
.with_add_bos_token(Some(true)) // Explicitly enable BOS
.build()?;
// Auto-detect (default)
let config = EngineConfig::builder()
.with_model_path("/path/to/model.gguf")
.with_model_name("some-model")
.with_add_bos_token(None) // Let the library decide
.build()?;
⚠️ IMPORTANT: The LlamaContext
from llama-cpp is !Send
and !Sync
, which means:
Arc
aloneThe library is designed with these constraints in mind:
!Send
due to llama-cpp constraintsExample of thread-safe usage with regular (non-singleton) engine:
use std::thread;
// Each thread needs its own engine instance due to llama-cpp constraints
let handles: Vec<_> = (0..4)
.map(|i| {
let config = config.clone(); // Clone config for each thread
thread::spawn(move || {
// Create engine instance in each thread
let engine = EmbeddingEngine::new(config)?;
let text = format!("Thread {} text", i);
let embedding = engine.embed(None, &text)?;
Ok::<_, embellama::Error>(embedding)
})
})
.collect();
for handle in handles {
let embedding = handle.join().unwrap()?;
// Process embedding
}
Or using the singleton pattern for shared access:
use std::thread;
// Initialize singleton once
let engine = EmbeddingEngine::get_or_init(config)?;
let handles: Vec<_> = (0..4)
.map(|i| {
let engine = engine.clone(); // Clone Arc<Mutex<>>
thread::spawn(move || {
let text = format!("Thread {} text", i);
let embedding = {
let engine_guard = engine.lock().unwrap();
engine_guard.embed(None, &text)?
};
Ok::<_, embellama::Error>(embedding)
})
})
.collect();
for handle in handles {
let embedding = handle.join().unwrap()?;
// Process embedding
}
The library provides granular control over model lifecycle:
// Check if model is registered (has configuration)
if engine.is_model_registered("my-model") {
println!("Model configuration exists");
}
// Check if model is loaded in current thread
if engine.is_model_loaded_in_thread("my-model") {
println!("Model is ready to use in this thread");
}
// Deprecated - use is_model_registered() for clarity
#[deprecated]
engine.is_model_loaded("my-model"); // Same as is_model_registered()
// Remove only from current thread (keeps registration)
engine.drop_model_from_thread("my-model")?;
// Model can be reloaded on next use
// Remove only from registry (prevents future loads)
engine.unregister_model("my-model")?;
// Existing thread-local instances continue working
// Full unload - removes from both registry and thread
engine.unload_model("my-model")?;
// Completely removes the model
EmbeddingEngine::new()
): Loaded immediately in current threadload_model()
): Lazy-loaded on first use// First model - loaded immediately
let engine = EmbeddingEngine::new(config)?;
assert!(engine.is_model_loaded_in_thread("model1"));
// Additional model - lazy loaded
engine.load_model(config2)?;
assert!(engine.is_model_registered("model2"));
assert!(!engine.is_model_loaded_in_thread("model2")); // Not yet loaded
// Triggers actual loading in thread
engine.embed(Some("model2"), "text")?;
assert!(engine.is_model_loaded_in_thread("model2")); // Now loaded
The library is optimized for high performance:
Run benchmarks with:
EMBELLAMA_BENCH_MODEL=/path/to/model.gguf cargo bench
embed_batch()
for multiple textsn_threads
based on CPU coreswarmup_model()
before processingFor development setup, testing, and contributing guidelines, please see DEVELOPMENT.md.
See the examples/
directory for more examples:
simple.rs
- Basic embedding generationbatch.rs
- Batch processing examplemulti_model.rs
- Using multiple modelsconfig.rs
- Configuration exampleserror_handling.rs
- Error handling patternsRun examples with:
cargo run --example simple
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Contributions are welcome! Please see DEVELOPMENT.md for development setup and contribution guidelines.
For issues and questions, please use the GitHub issue tracker.