| Crates.io | rusty-gradients |
| lib.rs | rusty-gradients |
| version | 0.2.0 |
| created_at | 2025-08-29 21:37:19.968699+00 |
| updated_at | 2026-01-19 07:57:51.441454+00 |
| description | A full-stack deep learning framework in Rust for training and deploying Transformer models. Features multi-backend support (CPU/CUDA/Metal/WASM), 62x GPU acceleration, Safetensors serialization, and BPE tokenization. |
| homepage | https://github.com/Xzdes/RustyGradients |
| repository | https://github.com/Xzdes/RustyGradients |
| max_upload_size | |
| id | 1816942 |
| size | 552,552 |
A Production-Ready Deep Learning Framework in Rust
RustyGradients is a high-performance deep learning framework designed for production use, featuring multi-backend support, efficient serialization, and automatic differentiation.
Add to your Cargo.toml:
[dependencies]
rusty-gradients = "0.2"
# Optional features
rusty-gradients = { version = "0.2", features = ["cpu-blas", "serialization"] }
| Feature | Description | Performance Gain |
|---|---|---|
cpu |
Basic CPU backend with rayon | Baseline |
cpu-blas |
OpenBLAS acceleration | 10-50x faster matmul |
cuda |
CUDA backend (NEW!) 🚀 | 62x speedup (4,778 GFLOPS) |
serialization |
Safetensors + checkpoint management | 3.5x smaller, 7-9x faster I/O |
tokenization |
BPE + HuggingFace tokenizers | 6.74x better compression |
huggingface |
Load pre-trained models (GPT-2, LLaMA) | $0 vs $50k training cost |
metal-backend |
Metal backend for Apple Silicon (coming soon) | 20-50x speedup |
# Run the complete GPT training example
cargo run --example train_gpt_e2e --features "cpu serialization"
# With BLAS acceleration (10-50x faster)
cargo run --example train_gpt_e2e --features "cpu-blas serialization" --release
# With CUDA GPU acceleration (62x faster!) 🚀 NEW!
cargo run --example train_gpt_e2e --features "cuda serialization" --release
Output:
=== RustyGradients End-to-End Training Example ===
📖 Loading training data...
Text length: 1031 characters
🔤 Creating tokenizer...
Vocabulary size: 52
🏗️ Initializing model...
- Vocabulary: 52
- Embedding dim: 128
- Layers: 4
- Total weights: 11
⚙️ Backend: CPU
BLAS acceleration: ENABLED (OpenBLAS)
🚀 Starting training...
[ 10/ 80] 12.5% | Loss: 3.9955 | Speed: 160.29 steps/s
[ 20/ 80] 25.0% | Loss: 3.9855 | Speed: 159.33 steps/s
...
[ 80/ 80] 100.0% | Loss: 3.9255 | Speed: 153.34 steps/s
✅ Training complete!
Total time: 0.52s
Average loss: 3.9605
💾 Checkpoint saved: checkpoints/gpt_training/checkpoint_step_000080.safetensors
use rusty_gradients::tensor::Tensor;
use ndarray::ArrayD;
// Create tensors
let a = Tensor::new(ArrayD::ones(vec![3, 3]), true);
let b = Tensor::new(ArrayD::ones(vec![3, 3]) * 2.0, true);
// Operations
let c = a.add(&b); // Element-wise addition
let d = a.matmul(&b); // Matrix multiplication
let e = c.relu(); // ReLU activation
// Backward pass
e.backward();
println!("Gradient: {:?}", a.grad());
use rusty_gradients::nn::{Linear, Module, ReLU, Sequential};
use rusty_gradients::optim::{Adam, Optimizer};
use rusty_gradients::tensor::Tensor;
use rusty_gradients::losses::mse_loss;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Training data for XOR problem
let training_data = Tensor::new(
ndarray::array![[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]].into_dyn(),
false,
);
let training_labels = Tensor::new(
ndarray::array![[0.0], [1.0], [1.0], [0.0]].into_dyn(),
false,
);
// Create model
let model = Sequential::new(vec![
Box::new(Linear::new(2, 4)),
Box::new(ReLU::new()),
Box::new(Linear::new(4, 1)),
]);
// Create optimizer
let mut optimizer = Adam::new(model.parameters(), 0.01, None, None);
// Training loop
for epoch in 0..=1000 {
let predictions = model.forward(&training_data)?;
let loss = mse_loss(&predictions, &training_labels);
loss.backward();
optimizer.step();
optimizer.zero_grad();
if epoch % 100 == 0 {
println!("Epoch: {}, Loss: {:.4}", epoch, loss.data.borrow().sum());
}
}
Ok(())
}
use rusty_gradients::serialization::{CheckpointManager, ModelMetadata};
// Create checkpoint manager
let manager = CheckpointManager::new("checkpoints", 3); // Keep last 3
// Save checkpoint
let metadata = ModelMetadata {
model_type: "GPT".to_string(),
vocab_size: 50257,
embedding_dim: 768,
num_layers: 12,
num_heads: 12,
block_size: 1024,
dropout: 0.1,
};
manager.save_checkpoint(
&weights,
&weight_names,
&metadata,
step,
loss,
)?;
// Load best checkpoint
let (weights, shapes, names, metadata) = manager.load_best()?;
use rusty_gradients::backend::{Backend, cuda::CudaBackend};
// Initialize CUDA backend
let backend = CudaBackend::new(0)?; // GPU 0
// Create matrices on GPU
let a = backend.from_slice(&[1.0, 2.0, 3.0, 4.0], &[2, 2])?;
let b = backend.from_slice(&[5.0, 6.0, 7.0, 8.0], &[2, 2])?;
// Matrix multiplication on GPU (62x faster!)
let c = backend.matmul(&a, &b)?;
backend.synchronize()?;
// Copy result back to CPU
let result = backend.to_vec(&c)?;
println!("Result: {:?}", result); // [19.0, 22.0, 43.0, 50.0]
Run CUDA demo:
cargo run --example cuda_demo --features cuda --release
cargo bench --bench cuda_comparison --features cuda
Expected Performance (1024×1024 matmul):
use rusty_gradients::serialization::{json, safetensors_format};
// Legacy JSON (slow, large)
json::save_json("model.json", &weights, &metadata, step, loss)?;
// Safetensors (3.5x smaller, 7-9x faster)
safetensors_format::save_model("model.safetensors", &weights, &names, &metadata)?;
Performance Comparison:
| Format | File Size | Save Time | Load Time |
|---|---|---|---|
| JSON | 675 MB | 3.40s | 1.83s |
| Safetensors | 193 MB | 0.46s | 0.22s |
| Improvement | 3.5x smaller | 7.4x faster | 8.3x faster |
cargo bench --bench blas_comparison
| Configuration | GFLOPS | vs Baseline |
|---|---|---|
| Naive (no BLAS) | 77 | 1x |
| OpenBLAS | 500+ | 6-10x |
| cuBLAS (CUDA) | 1500+ | 20-30x (coming soon) |
cargo bench --bench simd_benchmark
| Operation | Throughput | Speedup |
|---|---|---|
| ReLU | 1.0 GElements/s | 2-4x |
| Exp | 0.7 GElements/s | 2-4x |
| Sigmoid | 0.8 GElements/s | 2-4x |
cargo bench --bench layernorm_benchmark
| Method | Throughput | Memory Passes |
|---|---|---|
| Standard | 0.15 GElements/s | 2 passes |
| Fused (Welford) | 0.38 GElements/s | 1 pass |
use rusty_gradients::backend::{Device, cpu::CpuBackend};
// CPU backend
let device = Device::cpu();
let tensor = TensorV2::new_cpu(data, requires_grad);
// CUDA backend (coming soon)
#[cfg(feature = "cuda")]
let device = Device::cuda(0); // GPU 0
let tensor = tensor.to_device(&device);
use std::time::Instant;
struct ProgressTracker {
total_steps: usize,
current_step: usize,
losses: Vec<f32>,
start_time: Instant,
}
impl ProgressTracker {
fn update(&mut self, loss: f32) {
self.current_step += 1;
self.losses.push(loss);
if self.current_step % 10 == 0 {
let avg_loss = self.losses.iter().rev().take(10).sum::<f32>() / 10.0;
let progress = (self.current_step as f32 / self.total_steps as f32) * 100.0;
println!("[{:>6}/{:>6}] {:>5.1}% | Loss: {:.4}",
self.current_step, self.total_steps, progress, avg_loss);
}
}
}
RustyGradients can be compiled to WebAssembly for running neural networks in the browser.
# Install wasm-pack
cargo install wasm-pack
# Build WASM package
wasm-pack build --target web
import init, { WasmGptTrainer, init_panic_hook } from './pkg/rusty_gradients.js';
async function run() {
// Initialize WASM module
await init();
init_panic_hook();
// Create trainer
const config = {
blockSize: 32,
vocabSize: 65,
numLayers: 4,
numHeads: 4,
embeddingDim: 64,
learningRate: 0.001
};
const trainer = new WasmGptTrainer(
config.blockSize,
config.vocabSize,
config.numLayers,
config.numHeads,
config.embeddingDim,
config.learningRate
);
// Train
const xBatch = new Uint32Array([10, 20, 30]);
const yBatch = new Uint32Array([20, 30, 31]);
const loss = trainer.train_step(xBatch, yBatch);
console.log(`Loss: ${loss}`);
// Generate
const prompt = new Uint32Array([1, 2, 3]);
const generated = trainer.generate(prompt, 100, 0.8, 10);
console.log("Generated:", generated);
}
run();
Contributions are welcome! Please feel free to submit a Pull Request.
# Clone repository
git clone https://github.com/Xzdes/RustyGradients.git
cd RustyGradients
# Run tests
cargo test
# Run benchmarks
cargo bench
# Build with all features
cargo build --release --all-features
See Roadmap for planned features. Open an issue for new ideas!
MIT License - see LICENSE for details
Made with ❤️ in Rust