| Crates.io | simd-kernels |
| lib.rs | simd-kernels |
| version | 0.2.0 |
| created_at | 2025-08-28 03:31:32.802601+00 |
| updated_at | 2025-08-29 23:12:39.406921+00 |
| description | Lightning Fast, Arrow-Compatible Compute Kernels |
| homepage | |
| repository | |
| max_upload_size | |
| id | 1813525 |
| size | 1,673,158 |
Welcome to SIMD-Kernels.
SIMD-Kernels is a modern library of compute kernels built on top of std::simd for high-performance analytics and scientific computing in Rust.
It implements the core arithmetic, statistical, logical, and scientific operations required for data systems —
accelerated with SIMD, aligned for cache efficiency, and compatible with the Apache Arrow model.
The kernels it implements form the computational core of the analytics stack, and it integrates cleanly with the minarrow columnar runtime.
std::simdSIMD-kernels focuses on ergonomics, throughput, and correctness—designed to maximise hardware utilisation, whilst remaining easy to use for modern data breadth.std::simd with auto-vectorising fallback for scalar lanes.RUSTFLAGS="-C target-cpu=native".Vec64, with alignment checks for SIMD correctness.i8–u64, f32/f64) with overflow handling.erf, gamma, FFT, matrix/vector ops.Sub 2-second compile times with standard features.
Modular by design. Enable only what you need:
linear_algebra – BLAS/LAPACK via system librariesprobability_distributions – PDFs, CDFs, quantilesfourier_transforms – FFT operationsuniversal_functions – Scalar maths: exp, ln, sin, etc.[features]
default = []
linear_algebra = ["blas-src"]
probability_distributions = []
fourier_transforms = []
simd-kernels is divided into tightly scoped submodules:
simd-kernels/
├── kernels/
│ ├── arithmetic/ # SIMD + null-safe arithmetic
│ ├── aggregate/ # Sum, mean, variance, etc.
│ ├── comparison/ # SIMD comparisons
│ ├── logical/ # Boolean logic (AND, OR, XOR)
│ ├── conditional/ # if-then-else kernels
│ ├── string/ # String processing
│ ├── window/ # Sliding window kernels
│ ├── binary/ # Bitwise ops
│ ├── sort/ # Parallel SIMD sort kernels
│ ├── scientific/ # Special functions + FFT + matrix
│ │ ├── distributions/ # PDFs, CDFs, quantiles
│ │ ├── erf/ # Error functions
│ │ ├── fft/ # FFT pipelines
│ │ ├── matrix/ # Dense matrix kernels
│ │ ├── vector/ # SIMD vector ops
│ │ └── blas_lapack/ # External LAPACK bindings
├── traits/ # Kernel traits + marker traits
├── config/ # Compile-time feature flags
├── errors/ # KernelError definitions
└── utils/ # Internal helpers (alignment, dispatch)
use simd_kernels::kernels::arithmetic::add_f64_dense;
use minarrow::{FloatArray, arr_f64};
let a = arr_f64![1.0, 2.0, 3.0];
let b = arr_f64![10.0, 20.0, 30.0];
let result = add_f64_dense(&a, &b).unwrap();
assert_eq!(result.values(), &[11.0, 22.0, 33.0]);
All arithmetic kernels use SIMD internally and support both dense and null-masked variants. This is particularly effective for fused multiply-add (FMA) kernels.
SIMD-kernels supports Apache-Arrow compatible null-masks via Minarrow.
null_count = 0 to skip it similarly.This is useful in micro-batching contexts.
If you use Minarrow, you get this easily and with zero-copy semantics from very low-overhead types.
Note: this crate is not affiliated with Apache Arrow, however it implements Arrow-compatible null-semantics, and builds on Minarrow, which implements a focused subset of the Apache Arrow specification.
| Operation | SIMD Support |
|---|---|
+ - * / % |
✅ All numeric types |
< <= == != >= > |
✅ SIMD mask comparisons |
is_nan, is_null |
✅ SIMD + bitmap logic |
exp, ln, log10 |
✅ SIMD ufuncs |
normal_pdf |
⚠️ Yes, but only where it makes sense. 21 univariate families (60+ functions!) are implemented, and tested against SciPy. Roughly half of these are SIMD accelerated. |
fft8_radix |
✅ DIT radix-8 via SIMD complex ops |
matmul, dot, axpy |
✅ (optional via linear_algebra) |
if_else |
✅ SIMD-lane conditional |
sum, mean, stdev |
✅ SIMD + null-aware |
regex_match |
✅ via regex crate |
sort |
✅ SIMD radix sort |
Most statistical and scientific functions achieve relative error < 1e-15 for f64 compared to SciPy on standard domains.
Typical accuracy in the integration test suite is:
< 1e-15 for core functions (e.g., normal_pdf, gamma, erf) < 1e-14 for distributions across mean ranges < 1e-12 in certain heavy-tail or extreme domains < 1e-10 in certain boundary cases, where SciPy itself becomes numerically unstable
Each implementation is tested against reference outputs from SciPy, hardcoded from a valid baseline x86_64 platform. These values are embedded in the test suite, so you can run full accuracy tests on your own architecture (and expect minor floating point tolerances).
That said, no accuracy guarantees are made. This library is new, and while SciPy has benefited from over a decade of numerical tuning and user feedback, simd-kernels is still maturing. If you require strict numerical guarantees, you must perform your own validations on all critical paths.
We make no guarantees regarding numerical accuracy. If you rely on this library in critical contexts, you must perform your own validation. Use is at your own risk.
Vec64, which is the standard
Vec with a custom 64-byte allocator (demonstrated in that repo's benchmarks to be practically as fast as the standard Vec).Make sure SIMD compiles correctly with: RUSTFLAGS="-C target-cpu=native" cargo +nightly build --features simd
By default, simd-kernels uses conservative architecture-specific lane widths inferred from CARGO_CFG_TARGET_FEATURE. However, you may override these lane counts at build time to experiment or test alternate configurations.
Set the environment variable SIMD_LANES_OVERRIDE before compiling:
# Format: "W8,W16,W32,W64"
# For example, simulate AVX-512:
SIMD_LANES_OVERRIDE="64,32,16,8" \
RUSTFLAGS="-C target-cpu=native" \
cargo +nightly build --features simd
This will override the automatically detected SIMD widths with:
W8 = 64 (e.g. u8, i8)W16 = 32 (e.g. u16, i16)W32 = 16 (e.g. f32, i32)W64 = 8 (e.g. f64, i64)An easy way is to check what SIMD lanes your CPU actually supports is lscpu | grep -i width in bash, then look for the 'Flags' section. Then, the flags will show text e.g., avx, avx2, avx512f etc. This is how to then interpret that, so you can set the flags accordingly:
| Feature | Register Width | Lane Count (f64) | Lane Count (f32) | Lane Count (i16) |
|---|---|---|---|---|
| SSE2 | 128 bits | 2 | 4 | 8 |
| AVX | 256 bits | 4 | 8 | 16 |
| AVX-512 | 512 bits | 8 | 16 | 32 |
| NEON | 128 bits | 2 | 4 | 8 |
| WASM SIMD128 | 128 bits | 2 | 4 | 8 |
For e.g., W64 should be set to '2' if you are on SSE2 with a consumer laptop.
Obviously, the more lanes, the more parallel your computations will be within the same
thread. However this library purposely excludes thread-parallel computations given
that it is use case specific, with millisecond-level orchestration overhead. Combining
this library with Rayon will make data fast. Minarrow supports this pattern natively.
| Use Case | Description |
|---|---|
| Extreme low-latency computation | Very low abstraction overhead, direct kernels |
| Engine Kernel Layer | High-throughput compute for execution engines |
| Statistical Pipelines | SIMD evaluation of distributions + aggregates |
| Signal Processing | FFTs, filters, and transforms |
| Vectorised Scientific Computing | Accurate special functions |
| Columnar DBMS | Null-aware SIMD kernels for query pipelines |
| Embedded Systems | Compile-time feature gating for footprint control |
Flexible – Every kernel is statically typed and callable with no dynamic dispatch. Numerical kernels are all slice-compatible, so they support diverse entry contexts.
Fast – Always use SIMD lanes where possible, fallback only when required.
Guaranteed 64-byte alignment when using Minarrow's Vec64, IntegerArray or FloatArray types.
Composable – Minimal deps, fast builds, clean layering.
Feature-rich – Proper mask propagation and bitmap handling, even on univariate distributions. Or, opt out completely for standard float NaN semantics.
Compatible – When using via minarrow, you get FFI compatible buffers, .to_apache_arrow() and .to_polars().
We welcome contributions in the following areas:
See CONTRIBUTING.md for guidance.
Coming soon.
Licensed under the Mozilla Public License (MPL) 2.0
This license is in place to strike a balance between open ecosystem contribution, developer and enterprise needs.
If you have commercial requirements not covered by this license, please reach out directly.
Please open an issue or reach out with ideas, requests, or contributions.