| Crates.io | trueno-gpu |
| lib.rs | trueno-gpu |
| version | 0.4.11 |
| created_at | 2025-12-10 22:04:16.758873+00 |
| updated_at | 2026-01-25 14:21:14.556901+00 |
| description | Pure Rust PTX generation for NVIDIA CUDA - no LLVM, no nvcc |
| homepage | |
| repository | https://github.com/paiml/trueno |
| max_upload_size | |
| id | 1978856 |
| size | 3,910,203 |
Pure Rust PTX generation for NVIDIA CUDA - no LLVM, no nvcc, no external dependencies.
Own the Stack - Build everything from first principles for complete control, auditability, and reproducibility.
use trueno_gpu::ptx::{PtxModule, PtxKernel, PtxType};
// Build a vector addition kernel
let module = PtxModule::new()
.version(8, 0)
.target("sm_70")
.address_size(64);
let ptx_source = module.emit();
assert!(ptx_source.contains(".version 8.0"));
| Kernel | Description |
|---|---|
| GEMM | Matrix multiplication (naive, tiled, tensor core) |
| GEMV | Matrix-vector multiply with warp shuffle reduction |
| Softmax | Numerically stable softmax with warp shuffle |
| LayerNorm | Fused layer normalization |
| Attention | FlashAttention-style tiled attention |
| BiasActivation | Fused bias + activation epilogue (None/ReLU/GELU) |
| Quantize | Q4_K/Q5_K/Q6_K dequantization fused with matmul |
use trueno_gpu::kernels::{GemmKernel, Kernel};
// Create a tiled GEMM kernel
let kernel = GemmKernel::tiled(1024, 1024, 1024);
let ptx = kernel.emit_ptx();
// The PTX can be loaded by CUDA driver API
println!("{}", ptx);
# PTX quickstart - basic vector addition
cargo run -p trueno-gpu --example ptx_quickstart
# GEMM kernel variants (naive, tiled, tensor core)
cargo run -p trueno-gpu --example gemm_kernel
# Bias + Activation epilogue kernel (ReLU, GELU)
cargo run -p trueno-gpu --example bias_activation
# Quantized GEMM (Q5_K, Q6_K formats)
cargo run -p trueno-gpu --example q5k_q6k_gemm
# FlashAttention (requires CUDA)
cargo run -p trueno-gpu --example flash_attention_cuda --features cuda
# Register allocation visualization
cargo run -p trueno-gpu --example register_allocation
ptx - PTX code generation (builder pattern)kernels - Hand-optimized GPU kernelsdriver - CUDA driver API (minimal FFI, optional)memory - GPU memory managementbackend - Multi-backend abstractionMIT License - see LICENSE for details.
This crate is part of the Trueno high-performance compute library.