cubecl-core

Crates.io	cubecl-core
lib.rs	cubecl-core
version	0.9.0
created_at	2024-07-19 14:24:53.374967+00
updated_at	2026-01-15 14:05:07.851963+00
description	CubeCL core create
homepage
repository	https://github.com/tracel-ai/cubecl/tree/main/crates/cubecl-core
max_upload_size
id	1308732
size	825,713

Nathaniel Simard (nathanielsimard)

documentation

README

Multi-platform high-performance compute language extension for Rust.

TL;DR

With CubeCL, you can program your GPU using Rust, taking advantage of zero-cost abstractions to develop maintainable, flexible, and efficient compute kernels. CubeCL also comes with optimized runtimes managing memory management and lazy execution for any platform.

Supported Platforms

Platform	Runtime	Compiler	Hardware
WebGPU	wgpu	WGSL	Most GPUs
CUDA	CUDA	C++ (CUDA)	NVIDIA GPUs
ROCm	HIP	C++ (HIP)	AMD GPUs
Metal	wgpu	C++ (Metal)	Apple GPUs
Vulkan	wgpu	SPIR-V	Most GPUs on Linux & Windows
CPU	cpu	Rust	All Cpus, SIMD with most CPUs

Not all platforms support the same features. For instance Tensor Cores acceleration isn't supported on WebGPU yet. Using an instruction that isn't available on a platform will result with a compilation error at runtime. The launch function is normally responsible to dispatch the right kernel based on device properties.

Example

Simply annotate functions with the cube attribute to indicate that they should run on the GPU.

use cubecl::prelude::*;

#[cube(launch_unchecked)]
/// A [Line] represents a contiguous series of elements where SIMD operations may be available.
/// The runtime will automatically use SIMD instructions when possible for improved performance.
fn gelu_array<F: Float>(input: &Array<Line<F>>, output: &mut Array<Line<F>>) {
    if ABSOLUTE_POS < input.len() {
        output[ABSOLUTE_POS] = gelu_scalar(input[ABSOLUTE_POS]);
    }
}

#[cube]
fn gelu_scalar<F: Float>(x: Line<F>) -> Line<F> {
    // Execute the sqrt function at comptime.
    let sqrt2 = F::new(comptime!(2.0f32.sqrt()));
    let tmp = x / Line::new(sqrt2);

    x * (Line::erf(tmp) + 1.0) / 2.0
}

You can then launch the kernel using the autogenerated gelu_array::launch_unchecked function.

pub fn launch<R: Runtime>(device: &R::Device) {
    let client = R::client(device);
    let input = &[-1., 0., 1., 5.];
    let vectorization = 4;
    let output_handle = client.empty(input.len() * core::mem::size_of::<f32>());
    let input_handle = client.create(f32::as_bytes(input));

    unsafe {
        gelu_array::launch_unchecked::<f32, R>(
            &client,
            CubeCount::Static(1, 1, 1),
            CubeDim::new_1d(input.len() as u32 / vectorization),
            ArrayArg::from_raw_parts::<f32>(&input_handle, input.len(), vectorization as u8),
            ArrayArg::from_raw_parts::<f32>(&output_handle, input.len(), vectorization as u8),
        )
    };

    let bytes = client.read_one(output_handle);
    let output = f32::from_bytes(&bytes);

    // Should be [-0.1587,  0.0000,  0.8413,  5.0000]
    println!("Executed gelu with runtime {:?} => {output:?}", R::name(&client));
}

To see it in action, run the working GELU example with the following command:

cargo run --example gelu --features cpu  # cpu/simd runtime
cargo run --example gelu --features cuda # cuda runtime
cargo run --example gelu --features wgpu # wgpu runtime

Motivation

The goal of CubeCL is to ease the pain of writing highly optimized compute kernels that are portable across hardware. There is currently no adequate solution when you want optimal performance while still being multi-platform. You either have to write custom kernels for different hardware, often with different languages such as CUDA, Metal, or ROCm. To fix this, we created a Just-in-Time compiler with three core features: automatic vectorization, comptime, and autotune!

These features are extremely useful for anyone writing high-performance kernels, even when portability is not a concern. They improve code composability, reusability, testability, and maintainability, all while staying optimal. CubeCL also ships with a memory management strategy optimized for throughput with heavy buffer reuse to avoid allocations.

Our goal extends beyond providing an optimized compute language; we aim to develop an ecosystem of high-performance and scientific computing in Rust. To achieve this, we're developing linear algebra components that you can integrate into your own kernels. We currently have an highly optimized matrix multiplication module, leveraging Tensor Cores on NVIDIA hardware where available, while gracefully falling back to basic instructions on other platforms. While there's room for improvement, particularly in using custom instructions from newer NVIDIA GPUs, our implementation already delivers impressive performance.

We are a small team also building Burn, so don't hesitate to contribute and port algorithms; it can help more than you would imagine!

How it works

CubeCL leverages Rust's proc macro system in a unique two-step process:

Parsing: The proc macro parses the GPU kernel code using the syn crate.
Expansion: Instead of immediately generating an Intermediate Representation (IR), the macro generates a new Rust function.

The generated function, semantically similar to the original, is responsible for creating the IR when called. This approach differs from traditional compilers, which typically generate IR directly after parsing. Our method enables several key features:

Comptime: By not transforming the original code, it becomes remarkably easy to integrate compile-time optimizations.
Automatic Vectorization: By simply vectorizing the inputs of a CubeCL function, we can determine the vectorization factor of each intermediate variable during the expansion.
Rust Integration: The generated code remains valid Rust code, allowing it to be bundled without any dependency on the specific runtime.

Design

CubeCL is designed around - you guessed it - Cubes! More specifically, it's based on cuboids, because not all axes are the same size. Since all compute APIs need to map to the hardware, which are tiles that can be accessed using a 3D representation, our topology can easily be mapped to concepts from other APIs.

CubeCL - Topology

A cube is composed of units, so a 3x3x3 cube has 27 units that can be accessed by their positions along the x, y, and z axes. Similarly, a hyper-cube is composed of cubes, just as a cube is composed of units. Each cube in the hyper-cube can be accessed by its position relative to the hyper-cube along the x, y, and z axes. Hence, a hyper-cube of 3x3x3 will have 27 cubes. In this example, the total number of working units would be 27 x 27 = 729.

Topology Equivalence 👇

Since all topology variables are constant within the kernel entry point, we chose to use the Rust constant syntax with capital letters. Often when creating kernels, we don't always care about the relative position of a unit within a cube along each axis, but often we only care about its position in general. Therefore, each kind of variable also has its own axis-independent variable, which is often not present in other languages.

CubeCL	CUDA	WebGPU	Metal
CUBE_COUNT	N/A	N/A	N/A
CUBE_COUNT_X	gridDim.x	num_workgroups.x	threadgroups_per_grid.x
CUBE_COUNT_Y	gridDim.y	num_workgroups.y	threadgroups_per_grid.y
CUBE_COUNT_Z	gridDim.z	num_workgroups.z	threadgroups_per_grid.z
CUBE_POS	N/A	N/A	N/A
CUBE_POS_X	blockIdx.x	workgroup_id.x	threadgroup_position_in_grid.x
CUBE_POS_Y	blockIdx.y	workgroup_id.y	threadgroup_position_in_grid.y
CUBE_POS_Z	blockIdx.z	workgroup_id.z	threadgroup_position_in_grid.z
CUBE_DIM	N/A	N/A	N/A
CUBE_DIM_X	blockDim.x	workgroup_size.x	threads_per_threadgroup.x
CUBE_DIM_Y	blockDim.y	workgroup_size.y	threads_per_threadgroup.y
CUBE_DIM_Z	blockDim.z	workgroup_size.z	threads_per_threadgroup.z
UNIT_POS	N/A	local_invocation_index	thread_index_in_threadgroup
UNIT_POS_X	threadIdx.x	local_invocation_id.x	thread_position_in_threadgroup.x
UNIT_POS_Y	threadIdx.y	local_invocation_id.y	thread_position_in_threadgroup.y
UNIT_POS_Z	threadIdx.z	local_invocation_id.z	thread_position_in_threadgroup.z
PLANE_POS	N/A	subgroup_id	simdgroup_index_in_threadgroup
PLANE_DIM	warpSize	subgroup_size	threads_per_simdgroup
UNIT_POS_PLANE	N/A	subgroup_invocation_id	thread_index_in_simdgroup
ABSOLUTE_POS	N/A	N/A	N/A
ABSOLUTE_POS_X	N/A	global_id.x	thread_position_in_grid.x
ABSOLUTE_POS_Y	N/A	global_id.y	thread_position_in_grid.y
ABSOLUTE_POS_Z	N/A	global_id.z	thread_position_in_grid.z

Special Features

Automatic Vectorization

High-performance kernels should rely on SIMD instructions whenever possible, but doing so can quickly get pretty complicated! With CubeCL, you can specify the vectorization factor of each input variable when launching a kernel. Inside the kernel code, you still use only one type, which is dynamically vectorized and supports automatic broadcasting. The runtimes are able to compile kernels and have all the necessary information to use the best instruction! However, since the algorithmic behavior may depend on the vectorization factor, CubeCL allows you to access it directly in the kernel when needed, without any performance loss, using the comptime system!

Comptime

CubeCL isn't just a new compute language: though it feels like you are writing GPU kernels, you are, in fact, writing compiler plugins that you can fully customize! Comptime is a way to modify the compiler IR at runtime when compiling a kernel for the first time.

This enables lots of optimizations and flexibility without having to write many separate variants of the same kernels to ensure maximal performance.

Feature	Description
Instruction Specialization	Not all instructions are available on all hardware, but when a specialized one exists, it should be enabled with a simple if statement.
Automatic Vectorization	When you can use SIMD instructions, you should! But since not all hardware supports the same vectorization factors, it can be injected at runtime!
Loop Unrolling	You may want multiple flavors of the same kernel, with loop unrolling for only a certain range of values. This can be configured easily with Comptime.
Shape Specialization	For deep learning kernels, it's often crucial to rely on different kernels for different input sizes; you can do it by passing the shape information as Comptime values.
Compile Time Calculation	In general, you can calculate a constant using Rust runtime properties and inject it into a kernel during its compilation, to avoid recalculating it during each execution.

Autotuning

Autotuning drastically simplifies kernel selection by running small benchmarks at runtime to figure out the best kernels with the best configurations to run on the current hardware; an essential feature for portability. This feature combines gracefully with comptime to test the effect of different comptime values on performance; sometimes it can be surprising!

Even if the benchmarks may add some overhead when running the application for the first time, the information gets cached on the device and will be reused. It is usually a no-brainer trade-off for throughput-oriented programs such as deep learning models. You can even ship the autotune cache with your program, reducing cold start time when you have more control over the deployment target.

Resource

For now we don't have a lot of resources to learn, but you can look at the matrix multiplication library to see how CubeCL can be used. If you have any questions or want to contribute, don't hesitate to join the Discord.

Disclaimer & History

CubeCL is currently in alpha.

While CubeCL is used in Burn, there are still a lot of rough edges; it isn't refined yet. The project started as a WebGPU-only backend for Burn. As we optimized it, we realized that we needed an intermediate representation (IR) that could be optimized then compiled to WGSL. Having an IR made it easy to support another compilation target, so we made a CUDA runtime. However, writing kernels directly in that IR wasn't easy, so we created a Rust frontend using the syn crate. Navigating the differences between CUDA and WebGPU, while leveraging both platforms, forced us to come up with general concepts that worked everywhere. Hence, CubeCL was born!

Commit count: 862