| Crates.io | qlora-rs |
| lib.rs | qlora-rs |
| version | 1.0.1 |
| created_at | 2026-01-09 18:30:07.957081+00 |
| updated_at | 2026-01-25 03:01:58.448419+00 |
| description | 4-bit quantized LoRA (QLoRA) implementation with dual GGUF and Candle native export for Rust |
| homepage | https://github.com/tzervas/qlora-rs |
| repository | https://github.com/tzervas/qlora-rs |
| max_upload_size | |
| id | 2032579 |
| size | 229,740 |
4-bit quantized LoRA (QLoRA) implementation for Rust with dual GGUF and Candle native export.
qlora-rs provides efficient 4-bit quantization and QLoRA inference capabilities for Rust:
Status: 1.0.0 Release - Core quantization, QLoRA inference, training support, and dual export formats are fully functional.
[dependencies]
qlora-rs = "1.0"
use qlora_rs::{quantize_nf4, dequantize_nf4};
use candle_core::{Device, Tensor};
fn main() -> anyhow::Result<()> {
let device = Device::Cpu;
// Create some weights
let weights = Tensor::randn(0.0, 1.0, (4096, 4096), &device)?;
// Quantize to 4-bit NF4
let quantized = quantize_nf4(&weights, 64)?; // block_size = 64
println!("Original size: {} bytes", 4096 * 4096 * 4);
println!("Quantized size: {} bytes", quantized.size_bytes());
// Dequantize for computation
let restored = dequantize_nf4(&quantized, &device)?;
Ok(())
}
use qlora_rs::{QLoraConfig, QuantizedLinear};
use candle_core::{Device, Tensor, DType};
fn main() -> anyhow::Result<()> {
let device = Device::Cpu;
let config = QLoraConfig::default();
// Create layer from existing weights
let weights = Tensor::randn(0.0, 1.0, (768, 768), &device)?;
let layer = QuantizedLinear::from_weight(&weights, None, config, &device)?;
// Forward pass
let input = Tensor::zeros(&[1, 10, 768], DType::F32, &device)?;
let output = layer.forward(&input)?;
println!("Trainable parameters: {}", layer.num_trainable_parameters());
Ok(())
}
use qlora_rs::{quantize_nf4, export_gguf};
use candle_core::{Device, Tensor};
fn main() -> anyhow::Result<()> {
let device = Device::Cpu;
// Quantize model weights
let q_proj = Tensor::randn(0.0, 1.0, (4096, 4096), &device)?;
let q_proj_quantized = quantize_nf4(&q_proj, 64)?;
// Export to GGUF
export_gguf(
&[("model.layers.0.self_attn.q_proj.weight", &q_proj_quantized)],
"model.gguf",
)?;
Ok(())
}
NF4 (4-bit NormalFloat) uses 16 quantization levels optimized for normally-distributed data:
-1.0, -0.696, -0.525, -0.395, -0.284, -0.185, -0.091, 0.0,
0.080, 0.161, 0.246, 0.338, 0.441, 0.563, 0.723, 1.0
This provides better accuracy than uniform quantization for neural network weights.
Theoretical memory usage based on NF4 quantization (actual results may vary):
| Model Size | FP16 | NF4 (Expected) | Reduction |
|---|---|---|---|
| 7B params | 14GB | ~4GB | 3.5x |
| 13B params | 26GB | ~7GB | 3.7x |
| 70B params | 140GB | ~35GB | 4.0x |
Unmaintained paste dependency: The paste crate (used by gemm → candle-core) is unmaintained (RUSTSEC-2024-0436). This is a transitive dependency and does not affect functionality.
Solution: A maintained fork qlora-paste (v1.0.17) has been created and published to crates.io. To resolve this issue:
The gemm-fork/ directory contains a patched version of the gemm crates that use qlora-paste instead of paste.
To use this in your project, add the following to your workspace root Cargo.toml:
[patch.crates-io]
gemm = { path = "qlora-rs/gemm-fork/gemm" }
gemm-common = { path = "qlora-rs/gemm-fork/gemm-common" }
gemm-f16 = { path = "qlora-rs/gemm-fork/gemm-f16" }
gemm-f32 = { path = "qlora-rs/gemm-fork/gemm-f32" }
gemm-f64 = { path = "qlora-rs/gemm-fork/gemm-f64" }
gemm-c32 = { path = "qlora-rs/gemm-fork/gemm-c32" }
gemm-c64 = { path = "qlora-rs/gemm-fork/gemm-c64" }
The security audit warning is currently ignored in CI as the crate remains functional. Future updates to Candle may resolve this.
See workspace AGENTS.md for coding conventions.
Dual licensed under MIT OR Apache-2.0 at your option.