wgpu-algorithms

Crates.io	wgpu-algorithms
lib.rs	wgpu-algorithms
version	0.1.0
created_at	2026-01-10 18:32:11.266625+00
updated_at	2026-01-10 18:32:11.266625+00
description	High-performance, safe wgpu Radix Sort and Prefix Scan for Rust.
homepage
repository	https://github.com/samjsui/wgpu-algorithms
max_upload_size
id	2034523
size	102,778

Sam Sui (SamJSui)

documentation

README

wgpu-algorithms

A high-performance, safe WebGPU sorting and scanning library for Rust.

Safe Rust API. All memory management, bind groups, and synchronization are handled internally. No unsafe blocks in library code.

Performance Benchmarks

Benchmarks run on Apple M3 Max (Metal backend).

CPU: Rayon Parallel Sort (par_sort_unstable)
GPU Resident: Sorts data already on VRAM (Pipeline use-case)
GPU Round-Trip: Upload -> Sort -> Download (Utility use-case)

Items	CPU (Rayon)	GPU (Resident)	GPU (Round-Trip)	Verdict
100k	0.52 ms	6.0 ms	7.2 ms	❌ CPU Wins (Driver Overhead)
1M	4.5 ms	9.1 ms	10.1 ms	❌ CPU Wins
10M	44.1 ms	31.3 ms	40.9 ms	✅ GPU Wins (1.4x)
100M	506 ms	273 ms	407 ms	🚀 GPU Domination (1.85x)

Throughput (100M items):

Scan: ~5.2 Billion items/sec
Sort: ~365 Million items/sec

Prefix Scan (Inclusive Add)

Benchmarks include driver submission overhead (queue.submit + device.poll).

Items	Time	Throughput	Bandwidth (Effective)
100 M	19.2 ms	5.2 Gelem/s	~41.6 GB/s

Note: Bandwidth calculated as Read + Write (4 bytes * 2 * items / time).

Architecture

The library implements state-of-the-art parallel algorithms tailored for the WebGPU execution model:

LSD Radix Sort: A 2-bit pass (4 bins) decoupling "Counting" and "Scattering" kernels.
Hierarchical Scan: A "Reduce-Then-Scan" approach using 3 separate kernels (Downsweep, Scan-Aux, Upsweep) to handle arbitrary input sizes.
Vector Tiling (VT): Automatically adjusts items-per-thread based on GPU capability (e.g., VT=8 for Desktop, VT=4 for Mobile) to saturate memory bandwidth.

Features

Adaptive Sorting: Automatically switches between CPU (latency-optimized) and GPU (throughput-optimized) based on input size (< 1M items uses CPU).
Zero-Allocation Hot Loop: Reuses internal workspace buffers and pre-baked BindGroups to minimize driver pressure during animation loops.
WGPU Safe: Runs on Metal, Vulkan, DX12, and WebGPU without experimental features.

Usage

use wgpu_algorithms::{Context, Sorter};

#[tokio::main]
async fn main() {
    // 1. Initialize GPU Context
    let ctx = Context::init().await.unwrap();
    let mut sorter = Sorter::new(&ctx);

    // 2. Data
    let data: Vec<u32> = (0..50_000_000).map(|_| rand::random()).collect();

    // 3. Sort (Auto-selects CPU or GPU based on size)
    let sorted = sorter.sort(&data).await;
    
    // 4. Verification
    assert!(sorted[0] <= sorted[1]);
}

Installation

Add this to your Cargo.toml:

[dependencies]
wgpu-algorithms = "0.1.0"

Commit count: 1

wgpu-algorithms

documentation

README

wgpu-algorithms

Performance Benchmarks

Prefix Scan (Inclusive Add)

Architecture

Features

Usage

Installation

cargo fmt