| Crates.io | trueno |
| lib.rs | trueno |
| version | 0.14.2 |
| created_at | 2025-11-17 11:16:17.850307+00 |
| updated_at | 2026-01-25 15:02:22.795803+00 |
| description | High-performance SIMD compute library with GPU support for matrix operations |
| homepage | |
| repository | https://github.com/paiml/trueno |
| max_upload_size | |
| id | 1936618 |
| size | 6,905,186 |
trueno (Spanish: "thunder") provides unified compute primitives across CPU SIMD, GPU, and WebAssembly.
trueno-gpu (no nvcc required)wgpu[dependencies]
trueno = "0.11"
# Optional: GPU support for large matrices
trueno = { version = "0.11", features = ["gpu"] }
# Optional: Pure Rust CUDA PTX generation
trueno-gpu = "0.4"
use trueno::{Vector, Matrix, SymmetricEigen};
// Vector operations - auto-selects best SIMD backend
let a = Vector::from_slice(&[1.0, 2.0, 3.0, 4.0]);
let b = Vector::from_slice(&[5.0, 6.0, 7.0, 8.0]);
let sum = a.add(&b).unwrap(); // [6.0, 8.0, 10.0, 12.0]
let dot = a.dot(&b).unwrap(); // 70.0
let activated = a.relu().unwrap(); // ReLU activation
// Matrix operations
let m = Matrix::from_vec(2, 2, vec![1.0, 2.0, 3.0, 4.0]).unwrap();
let product = m.matmul(&m).unwrap(); // Matrix multiplication
let transposed = m.transpose(); // Transpose
// Batched matmul for transformers (Q @ K^T pattern)
let batch = 2; let heads = 4; let seq = 8; let dim = 64;
let q: Vec<f32> = vec![0.1; batch * heads * seq * dim];
let kt: Vec<f32> = vec![0.1; batch * heads * dim * seq];
let attn = Matrix::batched_matmul_4d(&q, &kt, batch, heads, seq, dim, seq).unwrap();
// Eigendecomposition (PCA, spectral analysis)
let cov = Matrix::from_vec(2, 2, vec![3.0, 1.0, 1.0, 3.0]).unwrap();
let eigen = SymmetricEigen::new(&cov).unwrap();
let eigenvalues = eigen.eigenvalues(); // [4.0, 2.0]
| Operation | SIMD Speedup | Notes |
|---|---|---|
| Dot product | 6-17x | AVX-512 for compute-bound |
| Matrix multiply | 2-10x | GPU for 500x500+ |
| Reductions (sum, max, min) | 3-12x | AVX-512 optimal |
| Element-wise (add, mul) | 1-2x | Memory-bound |
| Convolution 2D | 5-8x | AVX2/AVX-512 optimized |
| Benchmark | Throughput |
|---|---|
| Vector recip (AVX-512, 10K) | 10.0 Gelem/s |
| Vector recip (AVX2, 10K) | 9.7 Gelem/s |
| PTX module emit | 3.1 µs |
| PTX kernel build | 81 ns |
| Launch config | 1.7 ns |
GPU Note: GPU acceleration benefits matrix multiply only. Element-wise operations use CPU SIMD (GPU transfer overhead exceeds compute time).
Generate CUDA PTX kernels without nvcc, LLVM, or external toolchains:
use trueno_gpu::kernels::{GemmKernel, Kernel, SoftmaxKernel};
// Generate optimized GEMM kernel
let gemm = GemmKernel::tensor_core(1024, 1024, 1024);
let ptx = gemm.emit_ptx(); // Pure Rust PTX generation
// Generate softmax with warp shuffle reduction
let softmax = SoftmaxKernel::new(4096);
let ptx = softmax.emit_ptx();
// Available kernels: GEMM, Softmax, LayerNorm, Attention, Quantize (Q4K/Q5K/Q6K)
Vector: add, sub, mul, div, dot, sum, min, max, argmin, argmax, norm_l1, norm_l2, normalize, recip, sqrt, abs, clamp
Activations: relu, leaky_relu, elu, sigmoid, tanh, gelu, swish, softmax, log_softmax, silu
Matrix: matmul, batched_matmul, batched_matmul_4d, transpose, matvec, convolve2d, pooling (max/avg), topk, gather, pad
Statistics: mean, variance, stddev, covariance, correlation, zscore
Eigen: symmetric eigendecomposition (Jacobi algorithm)
GPU Kernels: GEMM (naive/tiled/tensor core), Softmax, LayerNorm, RMSNorm, Attention, GEMV, Quantization
cargo test # Run tests
cargo bench # Run benchmarks
make coverage # Coverage report (requires cargo-llvm-cov)
cargo run --example backend_detection # Check available backends
Part of the Pragmatic AI Labs stack:
MIT - see LICENSE