| Crates.io | bitnet-quantize |
| lib.rs | bitnet-quantize |
| version | 0.1.1 |
| created_at | 2026-01-25 00:14:44.395007+00 |
| updated_at | 2026-01-25 03:01:17.204144+00 |
| description | Microsoft BitNet b1.58 quantization and inference for Rust |
| homepage | https://github.com/tzervas/bitnet-quantize |
| repository | https://github.com/tzervas/bitnet-quantize |
| max_upload_size | |
| id | 2067758 |
| size | 126,772 |
Microsoft BitNet b1.58 implementation in Rust with ternary weight quantization.
bitnet-quantize implements the BitNet b1.58 architecture for efficient neural network inference:
nn::LinearAdd to your Cargo.toml:
[dependencies]
bitnet-quantize = "0.1"
[dependencies]
bitnet-quantize = { version = "0.1", features = ["cuda", "peft", "gguf-export"] }
| Feature | Description |
|---|---|
cuda |
GPU acceleration via CubeCL |
peft |
peft-rs adapter integration |
gguf-export |
Export to GGUF format |
use bitnet_quantize::{BitLinear, BitNetConfig};
use candle_core::{Device, Tensor};
use candle_nn::Module;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let device = Device::Cpu;
let config = BitNetConfig::default();
// Create layer from existing weights
let weight = Tensor::randn(0.0f32, 1.0, (512, 256), &device)?;
let layer = BitLinear::from_weight(&weight, None, &config)?;
// Forward pass
let input = Tensor::randn(0.0f32, 1.0, (4, 256), &device)?;
let output = layer.forward(&input)?;
println!("Input shape: {:?}", input.shape());
println!("Output shape: {:?}", output.shape());
println!("Compression ratio: {:.2}x", layer.compression_ratio());
println!("Weight sparsity: {:.1}%", layer.sparsity() * 100.0);
Ok(())
}
Weights are quantized to ternary values:
W_q = round(W / mean(|W|)) clamped to {-1, 0, +1}
Activations are quantized to INT8 per-token:
X_q = round(X * 127 / max(|X|)) clamped to [-127, +127]
| Original | Quantized | Compression |
|---|---|---|
| FP32 (32 bits) | 2 bits/weight | 16x |
| FP16 (16 bits) | 2 bits/weight | 8x |
use bitnet_quantize::BitNetConfig;
let config = BitNetConfig::builder()
.group_size(128) // Weights per scale group
.activation_bits(8) // INT8 activations
.per_token(true) // Per-token scaling
.use_ste(true) // Straight-Through Estimator
.build()?;
The Straight-Through Estimator enables training through quantization:
use bitnet_quantize::layer::{ternary_ste, int8_ste};
// Forward: quantize to ternary
let quantized = ternary_ste(&weights)?;
// Backward: gradients pass through unchanged
// (handled automatically by Candle's autograd)
Use BitNet as a PEFT adapter:
use bitnet_quantize::BitNetAdapter;
use peft_rs::Adapter;
let adapter = BitNetAdapter::new(config)?;
let adapted_weight = adapter.forward(&base_weight)?;
Benchmarks on CPU (Intel i7):
| Layer Size | Forward Pass | Quantization |
|---|---|---|
| 256x512 | 0.8ms | 0.2ms |
| 512x1024 | 2.1ms | 0.5ms |
| 1024x4096 | 12ms | 2.1ms |
Run benchmarks:
cargo bench -p bitnet-quantize
Full API documentation: docs.rs/bitnet-quantize
MIT License - see LICENSE for details.