unsloth-rs

Crates.iounsloth-rs
lib.rsunsloth-rs
version1.0.1
created_at2026-01-09 18:00:47.311682+00
updated_at2026-01-25 03:02:26.708027+00
descriptionRust implementations of transformer building blocks for LLM inference
homepage
repositoryhttps://github.com/tzervas/unsloth-rs
max_upload_size
id2032504
size1,081,841
Tyler Zervas (tzervas)

documentation

README

unsloth-rs

Rust implementations of transformer building blocks for LLM inference and fine-tuning.

Crates.io Documentation License: MIT

Overview

unsloth-rs provides Rust implementations of common transformer operations built on the Candle ML framework:

  • Multi-head attention with grouped-query attention (GQA) support
  • Rotary position embeddings (RoPE)
  • RMS normalization
  • SwiGLU activation

Status

Version 1.0.0 - Core functionality stable. Current implementations are CPU reference implementations with GPU dispatch that uses Candle's CUDA backend.

Implemented

  • ✅ Multi-head attention (CPU reference, Candle CUDA backend)
  • ✅ Rotary position embeddings (RoPE)
  • ✅ RMS normalization
  • ✅ SwiGLU activation
  • ✅ Memory estimation utilities
  • ✅ Ternary quantization (5-15x compression achieved)
  • ✅ Mixed precision training utilities (FP32/FP16/BF16)
  • ✅ Benchmarking suite (CPU)
  • ✅ 160 passing tests (100% pass rate)

In Progress

  • 🚧 Flash Attention CubeCL GPU kernel (Phase 1 complete, Phase 2 ready for RTX 5080 validation)
  • 🚧 Ternary GPU kernels (Phase 2-4 implemented, awaiting GPU profiling)
  • 🚧 CI/CD pipeline setup

Planned

  • ⏳ Gradient checkpointing (configuration exists, implementation planned)
  • ⏳ GPU performance validation on RTX 5080/3090 Ti
  • ⏳ RoPE, RMSNorm, SwiGLU GPU kernels
  • ⏳ Advanced sparsity optimizations
  • ⏳ Multi-GPU support

Installation

[dependencies]
unsloth-rs = "1.0.0"

For CUDA support (uses Candle's CUDA backend):

[dependencies]
unsloth-rs = { version = "1.0.0", features = ["cuda"] }

Usage

Attention

use unsloth_rs::kernels::{FusedAttention, FusedAttentionConfig};
use candle_core::{Device, Tensor};

fn main() -> anyhow::Result<()> {
    let device = Device::Cpu;
    
    let config = FusedAttentionConfig {
        hidden_size: 768,
        num_heads: 12,
        head_dim: 64,
        num_kv_heads: Some(4),  // GQA support
        ..Default::default()
    };
    
    let attention = FusedAttention::new(config, &device)?;
    
    // Create random input tensor: randn(mean, std_dev, shape, device)
    // 0.0f32 is Rust syntax for a 32-bit float literal with value 0.0
    let hidden_states = Tensor::randn(0.0f32, 1.0, (1, 128, 768), &device)?;
    let output = attention.forward(&hidden_states, None, None)?;
    
    Ok(())
}

Memory Estimation

use unsloth_rs::memory::{estimate_forward_memory, CheckpointConfig};

fn main() {
    let checkpoint = CheckpointConfig {
        enabled: true,
        checkpoint_every: 2,
    };
    
    let mem_bytes = estimate_forward_memory(
        4,     // batch_size
        2048,  // seq_len
        4096,  // hidden_size
        32,    // num_layers
        &checkpoint,
    );
    
    println!("Estimated memory: {} GB", mem_bytes as f64 / 1e9);
}

Benchmarks

Run benchmarks with:

cargo bench

Benchmarks test CPU performance across various configurations. GPU benchmarks require the cuda feature.

Development Roadmap

For detailed development plans and task breakdowns, see:

  • ROADMAP.md - Strategic development plan with phases and timelines
  • TASKS.md - Actionable task list with priorities and estimates
  • SUMMARY.md - Project review summary and execution guide

Contributing

Contributions are welcome, particularly:

  • GPU kernel implementations using CubeCL
  • Performance optimizations
  • Additional transformer operations

See TASKS.md for specific tasks that need implementation.

License

Licensed under the MIT License. See LICENSE for details.

Commit count: 155

cargo fmt