ringkernel-cuda

Crates.ioringkernel-cuda
lib.rsringkernel-cuda
version0.4.0
created_at2025-12-03 15:55:46.315364+00
updated_at2026-01-25 21:24:07.659532+00
descriptionCUDA backend for RingKernel - NVIDIA GPU support via cudarc
homepagehttps://github.com/mivertowski/RustCompute
repositoryhttps://github.com/mivertowski/RustCompute
max_upload_size
id1964391
size524,020
Michael Ivertowski (mivertowski)

documentation

README

ringkernel-cuda

NVIDIA CUDA backend for RingKernel.

Overview

This crate provides GPU compute support for RingKernel using NVIDIA CUDA via the cudarc library (v0.18.2). It implements the RingKernelRuntime trait for launching and managing persistent GPU kernels.

Requirements

  • NVIDIA GPU with Compute Capability 7.0 or higher (Volta, Turing, Ampere, Ada, Hopper)
  • CUDA Toolkit 12.x or later
  • cudarc 0.18.2 (managed via workspace)
  • Linux (native) or Windows (WSL2 with limitations)

Features

  • Persistent kernel execution using cooperative groups
  • Lock-free message queues in GPU global memory
  • PTX compilation at runtime via NVRTC
  • Multi-GPU device enumeration
  • Stencil kernel loading and execution

Usage

use ringkernel_cuda::CudaRuntime;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Check availability first
    if !ringkernel_cuda::is_cuda_available() {
        eprintln!("No CUDA device found");
        return Ok(());
    }

    let runtime = CudaRuntime::new().await?;
    let kernel = runtime.launch("processor", Default::default()).await?;

    // Process messages...

    kernel.terminate().await?;
    runtime.shutdown().await?;
    Ok(())
}

Stencil Kernel Loading

For pre-transpiled CUDA kernels:

use ringkernel_cuda::{StencilKernelLoader, LaunchConfig};

let loader = StencilKernelLoader::new(&cuda_device);
let kernel = loader.load_from_source(cuda_source)?;

let config = LaunchConfig {
    grid: (grid_x, grid_y, 1),
    block: (16, 16, 1),
    shared_mem: 0,
};

kernel.launch(&config, &[&input_buf, &output_buf])?;

cudarc 0.18.2 API

This crate uses cudarc 0.18.2 with the builder pattern for kernel launches:

use cudarc::driver::{CudaModule, CudaFunction, PushKernelArg};

// Load module and function
let module = device.inner().load_module(ptx)?;
let func = module.load_function("kernel_name")?;

// Launch with builder pattern
unsafe {
    stream
        .launch_builder(&func)
        .arg(&input_ptr)
        .arg(&output_ptr)
        .launch(cfg)?;
}

For cooperative kernel launches (grid-wide synchronization):

use cudarc::driver::result as cuda_result;

unsafe {
    cuda_result::launch_cooperative_kernel(
        func, grid_dim, block_dim, shared_mem_bytes, stream, kernel_params
    )?;
}

Exports

Type Description
CudaRuntime Main runtime implementing RingKernelRuntime
CudaDevice GPU device handle
CudaKernel Compiled kernel handle
CudaBuffer GPU memory buffer
CudaControlBlock GPU-resident kernel state
CudaMessageQueue Lock-free queue in GPU memory
StencilKernelLoader Loads CUDA stencil kernels

Platform Notes

Native Linux: Full support for persistent kernels using CUDA cooperative groups.

WSL2: Persistent kernels may not work due to cooperative group limitations. Falls back to event-driven execution.

Windows Native: Not currently supported. Use WSL2.

Testing

# Requires NVIDIA GPU
cargo test -p ringkernel-cuda --features cuda

License

Apache-2.0

Commit count: 170

cargo fmt