trueno-gpu

Crates.iotrueno-gpu
lib.rstrueno-gpu
version0.4.11
created_at2025-12-10 22:04:16.758873+00
updated_at2026-01-25 14:21:14.556901+00
descriptionPure Rust PTX generation for NVIDIA CUDA - no LLVM, no nvcc
homepage
repositoryhttps://github.com/paiml/trueno
max_upload_size
id1978856
size3,910,203
Noah Gift (noahgift)

documentation

README

trueno-gpu

Pure Rust PTX generation for NVIDIA CUDA - no LLVM, no nvcc, no external dependencies.

CI

Philosophy

Own the Stack - Build everything from first principles for complete control, auditability, and reproducibility.

Features

  • Pure Rust PTX Generation: Generate PTX assembly directly from Rust code
  • No External Dependencies: No LLVM, nvcc, or CUDA toolkit required for code generation
  • Builder Pattern API: Ergonomic API for constructing PTX modules and kernels
  • Hand-Optimized Kernels: Pre-built kernels for common ML operations

Quick Start

use trueno_gpu::ptx::{PtxModule, PtxKernel, PtxType};

// Build a vector addition kernel
let module = PtxModule::new()
    .version(8, 0)
    .target("sm_70")
    .address_size(64);

let ptx_source = module.emit();
assert!(ptx_source.contains(".version 8.0"));

Available Kernels

Kernel Description
GEMM Matrix multiplication (naive, tiled, tensor core)
GEMV Matrix-vector multiply with warp shuffle reduction
Softmax Numerically stable softmax with warp shuffle
LayerNorm Fused layer normalization
Attention FlashAttention-style tiled attention
BiasActivation Fused bias + activation epilogue (None/ReLU/GELU)
Quantize Q4_K/Q5_K/Q6_K dequantization fused with matmul

Usage

use trueno_gpu::kernels::{GemmKernel, Kernel};

// Create a tiled GEMM kernel
let kernel = GemmKernel::tiled(1024, 1024, 1024);
let ptx = kernel.emit_ptx();

// The PTX can be loaded by CUDA driver API
println!("{}", ptx);

Examples

# PTX quickstart - basic vector addition
cargo run -p trueno-gpu --example ptx_quickstart

# GEMM kernel variants (naive, tiled, tensor core)
cargo run -p trueno-gpu --example gemm_kernel

# Bias + Activation epilogue kernel (ReLU, GELU)
cargo run -p trueno-gpu --example bias_activation

# Quantized GEMM (Q5_K, Q6_K formats)
cargo run -p trueno-gpu --example q5k_q6k_gemm

# FlashAttention (requires CUDA)
cargo run -p trueno-gpu --example flash_attention_cuda --features cuda

# Register allocation visualization
cargo run -p trueno-gpu --example register_allocation

Modules

  • ptx - PTX code generation (builder pattern)
  • kernels - Hand-optimized GPU kernels
  • driver - CUDA driver API (minimal FFI, optional)
  • memory - GPU memory management
  • backend - Multi-backend abstraction

Requirements

  • Rust 1.70+
  • For GPU execution: NVIDIA CUDA driver (optional, only needed to run generated PTX)

License

MIT License - see LICENSE for details.

Part of Trueno

This crate is part of the Trueno high-performance compute library.

Commit count: 830

cargo fmt