oxiblas-core

Crates.io	oxiblas-core
lib.rs	oxiblas-core
version	0.1.2
created_at	2025-12-27 22:17:14.270995+00
updated_at	2025-12-29 20:57:37.563646+00
description	Core traits and SIMD abstractions for OxiBLAS
homepage
repository	https://github.com/cool-japan/oxiblas
max_upload_size
id	2007915
size	406,737

KitaSan (cool-japan)

documentation

README

oxiblas-core

Core traits, SIMD abstractions, and scalar types for the OxiBLAS library

Overview

oxiblas-core is the foundational crate for OxiBLAS, providing the core abstractions and building blocks used throughout the library. It is designed to be platform-agnostic with architecture-specific optimizations for x86_64 (AVX2/AVX-512) and AArch64 (NEON).

Features

Core Traits

Scalar - Fundamental trait for numeric types supported by BLAS/LAPACK
- Implemented for: f32, f64, Complex<f32>, Complex<f64>
- Optional support for f16 (half precision) and f128 (quad precision)
- Provides type-safe operations and conversions

SIMD Abstractions

Architecture-specific vectorization with automatic fallback:

x86_64:

AVX-512 (512-bit): 8×f64 or 16×f32 per instruction
AVX2/FMA (256-bit): 4×f64 or 8×f32 per instruction
SSE4.1/SSE4.2 (128-bit): 2×f64 or 4×f32 per instruction

AArch64:

NEON (128-bit): 2×f64 or 4×f32 per instruction
Advanced 4×6 micro-kernels optimized for Apple Silicon

Fallback:

Scalar operations for platforms without SIMD support

Extended Precision Types

f16 (half precision) - 16-bit floating point (with f16 feature)
- Useful for memory-constrained applications
- Hardware acceleration on ARM and modern x86_64
f128 (quad precision) - ~31 decimal digits precision (with f128 feature)
- Based on double-double arithmetic
- Essential for high-accuracy numerical computations
- Kahan and pairwise summation algorithms

Memory Management

Cache-aware allocation - Platform-specific cache line alignment
Memory alignment - SIMD-friendly memory layout (16/32/64-byte alignment)
Workspace management - Efficient temporary buffer reuse for LAPACK algorithms

Blocking & Tuning

Automatic blocking parameters - Cache-aware tile sizes for GEMM and other operations
Platform detection - Runtime detection of cache sizes (L1/L2/L3)
- Linux: sysfs (/sys/devices/system/cpu/)
- macOS: sysctl
- x86_64: CPUID instruction
Optimized for:
- Intel Xeon (256KB-512KB L2): KC=192, MC=128
- Apple Silicon (16MB L2): KC=448, MC=256
- AMD Zen (512KB L2): KC=192, MC=Variable

Parallel Operations

Rayon integration (with parallel feature)
Multi-threaded BLAS Level 3 - Automatic parallelization for large matrices
Load balancing - Efficient work distribution across cores
Cache-aware parallel blocking - Minimizes false sharing

Installation

Add this to your Cargo.toml:

[dependencies]
oxiblas-core = "0.1"

# With extended precision
oxiblas-core = { version = "0.1", features = ["f16", "f128"] }

# With parallelization
oxiblas-core = { version = "0.1", features = ["parallel"] }

# All features
oxiblas-core = { version = "0.1", features = ["f16", "f128", "parallel"] }

Usage

Basic Scalar Operations

use oxiblas_core::scalar::Scalar;

fn dot_product<T: Scalar>(x: &[T], y: &[T]) -> T {
    x.iter()
        .zip(y.iter())
        .map(|(a, b)| *a * *b)
        .fold(T::zero(), |acc, v| acc + v)
}

// Works with f32, f64, Complex<f32>, Complex<f64>
let x = vec![1.0f64, 2.0, 3.0];
let y = vec![4.0f64, 5.0, 6.0];
let result = dot_product(&x, &y); // 32.0

SIMD Operations

use oxiblas_core::simd::{SimdType, SimdOps};

// Automatic SIMD selection based on platform
let x: Vec<f64> = vec![1.0, 2.0, 3.0, 4.0];
let y: Vec<f64> = vec![5.0, 6.0, 7.0, 8.0];
let mut result = vec![0.0; 4];

// Uses AVX2/NEON automatically if available
unsafe {
    let simd = <f64 as SimdType>::simd();
    simd.fma(&x, &y, &mut result);
    // result = x * y + result
}

Extended Precision

use oxiblas_core::scalar::QuadFloat;

#[cfg(feature = "f128")]
{
    // Quad precision (f128) - ~31 decimal digits
    let x = QuadFloat::from(2.0);
    let sqrt_x = x.sqrt();
    println!("√2 = {}", sqrt_x); // Very high precision
}

Kahan Summation

use oxiblas_core::scalar::kahan_sum;

let values: Vec<f64> = vec![1.0, 1e-16, -1.0]; // Difficult for naive sum
let result = kahan_sum(&values); // Accurate result using compensated summation

Cache Detection

use oxiblas_core::tuning::detect_cache_sizes;

let cache = detect_cache_sizes();
println!("L1D: {} KB", cache.l1d / 1024);
println!("L2:  {} KB", cache.l2 / 1024);
println!("L3:  {} KB", cache.l3 / 1024);

Blocking Parameters

use oxiblas_core::blocking::BlockParams;

// Get optimal blocking parameters for GEMM
let params = BlockParams::for_gemm::<f64>();
println!("MC={}, KC={}, NC={}", params.mc, params.kc, params.nc);
// Automatically tuned for your system's cache hierarchy

Feature Flags

Feature	Description	Default
`default`	Core functionality (f32, f64, complex)	✓
`parallel`	Rayon-based parallelization
`f16`	Half-precision (16-bit) floating point
`f128`	Quad-precision (~31 digits) via double-double
`nightly`	Nightly-only optimizations
`force-scalar`	Disable SIMD, use scalar only (debug)
`max-simd-128`	Limit to 128-bit SIMD (SSE/NEON)
`max-simd-256`	Limit to 256-bit SIMD (AVX2)

SIMD Support Matrix

Platform	128-bit	256-bit	512-bit
x86_64 (SSE4.1)	✓
x86_64 (AVX2)	✓	✓
x86_64 (AVX-512)	✓	✓	✓
AArch64 (NEON)	✓
AArch64 (SVE)	✓	Planned
Fallback (scalar)	✓

Performance

SIMD Performance (Apple M3, NEON)

Operation	Size	Scalar	NEON (128-bit)	Speedup
f64 Add	4,096	15.2 µs	7.98 µs	1.9×
f64 FMA	4,096	22.1 µs	11.29 µs	2.0×
f32 Add	4,096	8.1 µs	3.2 µs	2.5×
f32 FMA	4,096	11.5 µs	4.8 µs	2.4×

SIMD Performance (Linux x86_64, AVX2)

Operation	Size	Scalar	AVX2 (256-bit)	Speedup
f64 Add	4,096	18.4 µs	7.98 µs	2.3×
f64 FMA	4,096	26.7 µs	11.29 µs	2.4×
f32 Add	4,096	9.8 µs	2.1 µs	4.7×
f32 FMA	4,096	14.2 µs	3.2 µs	4.4×

Architecture

oxiblas-core/
├── scalar.rs          # Scalar trait, f16, f128, extended precision
├── simd.rs            # SIMD abstraction layer
├── simd/
│   ├── avx2.rs        # AVX2/FMA kernels (x86_64)
│   ├── avx512.rs      # AVX-512 kernels (x86_64)
│   ├── neon.rs        # NEON kernels (AArch64)
│   └── scalar.rs      # Fallback scalar implementation
├── memory/
│   ├── align.rs       # Aligned allocation
│   ├── workspace.rs   # Temporary buffer management
│   └── cache.rs       # Cache-aware utilities
├── blocking.rs        # Blocking parameter calculation
├── tuning.rs          # Platform detection and auto-tuning
└── parallel.rs        # Parallel operations with Rayon

Supported Platforms

Tier 1 (Fully Tested)

x86_64: Linux, macOS, Windows
AArch64: macOS (Apple Silicon), Linux

Tier 2 (Best Effort)

x86: Linux, Windows
AArch64: Android, iOS
RISC-V: Linux (scalar only)

Requirements

Rust: 1.85+ (Edition 2024)
No external C dependencies
Optional: OpenMP or Rayon for parallelization

Examples

See the examples directory in the main repository:

basic_simd.rs - SIMD operations
extended_precision.rs - f16 and f128 usage
cache_tuning.rs - Platform-specific optimization

Benchmarks

Run benchmarks:

# SIMD benchmarks
cargo bench --package oxiblas-core --bench simd

# Blocking parameter benchmarks
cargo bench --package oxiblas-core --bench blocking

Safety

All SIMD operations are properly marked unsafe where required
Memory alignment is enforced at compile-time where possible
Extensive testing across platforms ensures correctness
No undefined behavior in safe APIs

Contributing

Contributions are welcome! Areas of interest:

ARM SVE support - Scalable Vector Extension for future ARM
RISC-V vector - Vector extension support
Additional extended precision - Alternative quad-float implementations
Auto-tuning improvements - Better platform detection

Related Crates

oxiblas-matrix - Matrix types built on oxiblas-core
oxiblas-blas - BLAS operations using oxiblas-core
oxiblas-lapack - LAPACK decompositions
oxiblas - Meta-crate with unified API

License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE)
MIT license (LICENSE-MIT)

at your option.

References

BLIS Design - Blocking and micro-kernel design inspiration
Intel Intrinsics Guide - x86_64 SIMD reference
ARM NEON Intrinsics - AArch64 SIMD reference
Kahan Summation - Compensated summation algorithm

Commit count: 0

oxiblas-core

documentation

README

oxiblas-core

Overview

Features

Core Traits

SIMD Abstractions

Extended Precision Types

Memory Management

Blocking & Tuning

Parallel Operations

Installation

Usage

Basic Scalar Operations

SIMD Operations

Extended Precision

Kahan Summation

Cache Detection

Blocking Parameters

Feature Flags

SIMD Support Matrix

Performance

SIMD Performance (Apple M3, NEON)

SIMD Performance (Linux x86_64, AVX2)

Architecture

Supported Platforms

Tier 1 (Fully Tested)

Tier 2 (Best Effort)

Requirements

Examples

Benchmarks

Safety

Contributing

Related Crates

License

References

cargo fmt