simd-lookup

Crates.iosimd-lookup
lib.rssimd-lookup
version0.1.0
created_at2025-12-18 23:34:38.370875+00
updated_at2025-12-18 23:34:38.370875+00
descriptionHigh-performance SIMD utilities for fast table lookups, compression and data processing
homepagehttps://github.com/velvia/simd-lookup
repositoryhttps://github.com/velvia/simd-lookup
max_upload_size
id1993843
size314,480
Evan Chan (velvia)

documentation

https://docs.rs/simd-lookup

README

simd-lookup

High-performance SIMD utilities for fast table lookups, compression and data processing in Rust.

Features

  • Cross-platform SIMD: Automatic dispatch to optimal implementation (AVX-512, AVX2, NEON)
  • Zero-cost abstractions: Thin wrappers over platform intrinsics via the wide crate
  • Comprehensive utilities: Compress, shuffle, widen, split, and bitmask operations

CPU Feature Requirements

This crate automatically detects and uses the best available CPU features, with fallbacks for older CPUs. The crate is optimized for both ARM NEON (aarch64) and Intel AVX-512 (x86_64) architectures.

Note: Table64 is primarily optimized for ARM NEON using the TBL4 instruction, which provides excellent performance on Apple Silicon and other ARMv8+ CPUs. On Intel x86_64, it requires newer AVX-512 features (Ice Lake+).

Summary Table

Module/Feature Required CPU Features Available CPUs Fallback
simd_compress (compress_store_u32x8) AVX512F + AVX512VL (x86), NEON TBL2 (ARM) Skylake-X+, Ice Lake+, All ARM NEON TBL on ARM, Shuffle table elsewhere
simd_compress (compress_store_u32x16) AVX512F Skylake-X+, Ice Lake+ Two u32x8 compresses
simd_compress (compress_store_u8x16) AVX512VBMI2 + AVX512VL (x86), NEON TBL (ARM) Ice Lake+, Tiger Lake+, All ARM NEON TBL on ARM, gather-style writes elsewhere
simd_gather (gather_u32index_u8) AVX512F + AVX512BW Skylake-X+, Ice Lake+ Scalar loop
simd_gather (gather_u32index_u32) AVX512F Skylake-X+, Ice Lake+ Scalar loop
Table64 ARM NEON TBL4 (aarch64) or AVX512BW + AVX512VBMI (x86_64) All ARMv8+ (Apple Silicon), Ice Lake+ Scalar lookup (x86_64 only)
Table2dU8xU8 AVX512F + AVX512BW Skylake-X+, Ice Lake+ Scalar lookup
Cascading Lookup Kernel AVX512F + AVX512VL + AVX512BW + AVX512VBMI2 Ice Lake+, Tiger Lake+ Scalar lookup

Detailed Requirements

SIMD Compress Kernels (simd_compress module)

  • compress_store_u32x8:

    • Intel x86_64: Requires AVX512F + AVX512VL, uses VPCOMPRESSD instruction
    • ARM aarch64: Uses NEON TBL2 with precomputed byte-level shuffle indices
      • Eliminates 8 conditional branches from scalar fallback
      • 256×32 byte lookup table for O(1) index computation
    • Available on: Intel Skylake-X+, All ARMv8+ (Apple Silicon M1/M2/M3)
    • Fallback: Shuffle-based table lookup (other architectures)
  • compress_store_u32x16: Requires AVX512F

    • Uses VPCOMPRESSD instruction (512-bit variant)
    • Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
    • Fallback: Two compress_store_u32x8 operations
  • compress_store_u8x16:

    • Intel x86_64: Requires AVX512VBMI2 + AVX512VL, uses VPCOMPRESSB instruction
    • ARM aarch64: Uses NEON TBL (vqtbl1q_u8) with precomputed shuffle indices
      • Eliminates 16 conditional branches from scalar fallback
      • 64KB lookup table (65536×16 bytes) for O(1) index computation
      • Single TBL instruction performs entire 16-byte shuffle
    • Available on: Intel Ice Lake+, All ARMv8+ (Apple Silicon M1/M2/M3)
    • Fallback: Gather-style direct writes (other architectures)

SIMD Gather Operations (simd_gather module)

  • gather_u32index_u8: Requires AVX512F + AVX512BW

    • Uses VGATHERDPS + VPMOVDB instructions
    • Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
    • Fallback: Scalar loop
  • gather_u32index_u32: Requires AVX512F

    • Uses VGATHERDPS instruction
    • Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
    • Fallback: Scalar loop

Small Table Lookups (small_table module)

  • Table64: Highly optimized for ARM NEON (primary optimization target)

    • ARM aarch64 (Apple Silicon, etc.): Uses ARM NEON TBL4 instruction (vqtbl4q_u8)
      • Native hardware support on all ARMv8+ CPUs (including Apple M1/M2/M3)
      • Extremely efficient single-instruction 64-byte table lookup
      • No fallback needed - full SIMD acceleration on ARM
    • Intel x86_64: Requires AVX512BW + AVX512VBMI
      • Uses VPERMB instruction (_mm512_permutexvar_epi8) for 64-byte table lookups
      • Available on: Intel Ice Lake, Tiger Lake, and later (not available on Skylake-X)
      • Fallback: Scalar lookup (works on all x86_64 CPUs)
  • Table2dU8xU8: Requires AVX512F + AVX512BW (via simd_gather)

    • Uses VGATHERDPS + VPMOVDB for parallel lookups
    • Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
    • Fallback: Scalar lookup

Cascading Lookup Kernels (lookup_kernel module)

  • SimdCascadingTableU32U8Lookup: Requires AVX512F + AVX512VL + AVX512BW + AVX512VBMI2
    • Uses compress_store_u8x16, compress_store_u32x16, and gather_u32index_u8
    • Provides 40-50% speedup over scalar implementations on large tables
    • Available on: Intel Ice Lake, Tiger Lake, and later (not available on Skylake-X)
    • Fallback: Scalar lookup (works on all architectures)

CPU Generation Reference

  • Skylake-X (2017): AVX512F, AVX512VL, AVX512BW ✅ | AVX512VBMI ❌ | AVX512VBMI2 ❌
  • Ice Lake (2019): AVX512F, AVX512VL, AVX512BW, AVX512VBMI, AVX512VBMI2 ✅
  • Tiger Lake (2020): AVX512F, AVX512VL, AVX512BW, AVX512VBMI, AVX512VBMI2 ✅
  • Apple Silicon (M1/M2/M3): ARM NEON (TBL4) ✅ - no AVX-512 equivalent needed

Checking CPU Features

You can check which features your CPU supports:

# Linux
grep flags /proc/cpuinfo | head -1

# Or use Rust's feature detection
cargo run --example check_features

All functions automatically detect available CPU features at runtime and use the best available implementation.

SIMD Utilities (wide_utils module)

This crate provides a rich set of SIMD utilities built on top of the wide crate, with optimized implementations for x86_64 (AVX-512/AVX2) and aarch64 (NEON).

Compress Operations (simd_compress module)

Stream compaction similar to AVX-512's VCOMPRESS instruction — pack selected elements contiguously based on a bitmask.

🚀 Highly optimized for ARM NEON — achieves up to 12 Gelem/s on Apple Silicon!

use simd_lookup::{compress_store_u32x8, compress_store_u32x16, compress_store_u8x16};
use wide::{u32x8, u32x16, u8x16};

// Compress u32x8: select elements where mask bits are set
let data = u32x8::from([10, 20, 30, 40, 50, 60, 70, 80]);
let mask = 0b10110010u8; // Select positions 1, 4, 5, 7
let mut output = [0u32; 8];  // Must have room for full vector!

let count = compress_store_u32x8(data, mask, &mut output);
// count == 4, output[0..4] == [20, 50, 60, 80]

// Also available for u32x16 (512-bit) and u8x16

Note: Destination buffer must have room for the full uncompressed vector (8/16 elements). This enables fast direct NEON stores instead of variable-length copies.

Function AVX-512 ARM NEON Throughput (ARM)
compress_store_u32x8 VPCOMPRESSD TBL2 + direct store ~4.3 Gelem/s
compress_store_u32x16 VPCOMPRESSD 2× NEON u32x8 ~5.3 Gelem/s
compress_store_u8x16 VPCOMPRESSB TBL + direct store ~12 Gelem/s

Shuffle/Permute Operations

Variable-index shuffle using the same SIMD type for indices (zero-copy from lookup tables):

use simd_lookup::WideUtilsExt;
use wide::u32x8;

let data = u32x8::from([10, 20, 30, 40, 50, 60, 70, 80]);
let indices = u32x8::from([7, 6, 5, 4, 3, 2, 1, 0]); // Reverse

let reversed = data.shuffle(indices);
// reversed == [80, 70, 60, 50, 40, 30, 20, 10]
Type AVX2 NEON Scalar
u32x8 VPERMD TBL2 (byte-level) Loop
u32x4 TBL (byte-level) Loop
u8x16 PSHUFB TBL Loop

Vector Splitting (SimdSplit trait)

Efficiently extract high/low halves of wide vectors:

use simd_lookup::SimdSplit;
use wide::u32x16;

let data = u32x16::from([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]);
let (lo, hi) = data.split_low_high();
// lo: u32x8 = [1,2,3,4,5,6,7,8]
// hi: u32x8 = [9,10,11,12,13,14,15,16]

// Or extract just one half
let low_half = data.low_half();
let high_half = data.high_half();
Type AVX-512 Fallback
u32x16 → u32x8 _mm512_extracti64x4_epi64 Array slicing
u64x8 → u64x4 _mm512_extracti64x4_epi64 Array slicing

Widening Operations

Zero-extend smaller types to larger types:

use simd_lookup::WideUtilsExt;
use wide::{u32x8, u64x8};

let input = u32x8::from([1, 2, 3, 4, 5, 6, 7, 8]);
let widened: u64x8 = input.widen_to_u64x8();
// widened == [1u64, 2, 3, 4, 5, 6, 7, 8]
Type AVX-512 AVX2 NEON
u32x8 → u64x8 VPMOVZXDQ VPMOVZXDQ VMOVL
u32x4 → u64x4 VPMOVZXDQ VMOVL

Bitmask to Vector Conversion

Convert a scalar bitmask to a SIMD mask vector:

use simd_lookup::FromBitmask;
use wide::u64x8;

let mask = 0b10101010u8;
let mask_vec: u64x8 = u64x8::from_bitmask(mask);
// mask_vec == [0, MAX, 0, MAX, 0, MAX, 0, MAX]
Type AVX-512 ARM NEON AVX2/Other
u64x8 VPBROADCASTQ + mask VCEQ + VMOVL chain Loop
u32x8 VPBROADCASTD + mask VCEQ + VMOVL chain Loop

Double (double() method on WideUtilsExt)

Efficiently double each element via self + self. Addition is well-supported on all architectures (NEON vaddq, SSE paddb), making this the most efficient way to multiply by powers of 2:

use simd_lookup::WideUtilsExt;
use wide::u8x16;

let a = u8x16::from([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]);

// x * 2
let doubled = a.double();
// doubled == [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]

// x * 8 (chain three doubles)
let times_8 = a.double().double().double();
// times_8 == [8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128]

This is more efficient than scalar multiplication for types like u8x16 where x86 lacks native byte multiply/shift instructions.

Shuffle Index Tables

Pre-computed shuffle indices for compress operations (256 entries for 8-element masks):

use simd_lookup::{SHUFFLE_COMPRESS_IDX_U32X8, get_compress_indices_u32x8};

// Raw array access
let indices: [u32; 8] = SHUFFLE_COMPRESS_IDX_U32X8[0b10110010];
// indices == [1, 4, 5, 7, 7, 7, 7, 7] (unused positions filled with 7)

// Zero-cost SIMD access via transmute
let simd_indices = get_compress_indices_u32x8(0b10110010u8);

Other Modules

small_table — Small Table SIMD Lookup

64-entry lookup table primarily optimized for ARM NEON TBL4 (excellent performance on Apple Silicon) and also supports AVX-512 VPERMB on Intel Ice Lake+. Useful for fast pattern detection and small dictionary lookups.

prefetch — SIMD Memory Prefetch

Cross-platform memory prefetch utilities including masked prefetch for 8 addresses at once. Supports L1/L2/L3 cache hints.

bulk_vec_extender — Efficient Vec Extension

Utilities for efficiently extending Vec with SIMD-produced results, minimizing bounds checks and reallocations.

entropy_map_lookup — Entropy-Optimized Lookups

Lookup structures optimized for low-entropy (few unique values) data, using bitpacking and small lookup tables.

eight_value_lookup — 8-Value Fast Path

Specialized lookup for tables with ≤8 unique values, using SIMD comparison and bitmask extraction.

Performance Notes

ARM NEON Compress Performance (Apple Silicon M1/M2/M3)

The NEON compress operations achieve exceptional throughput through optimized direct vector stores:

Operation Throughput vs Scalar
compress_store_u8x16 ~12 Gelem/s ~8× faster
compress_store_u32x8 ~4.3 Gelem/s ~3-4× faster
compress_store_u32x16 ~5.3 Gelem/s ~5-6× faster

Key optimizations:

  • Direct NEON stores: Uses vst1q_u8 to write full vectors instead of variable-length copies
  • Single TBL instruction: compress_store_u8x16 uses one vqtbl1q_u8 for 16-byte shuffle
  • Precomputed byte indices: Lookup tables eliminate runtime index computation
  • No branches: Mask-dependent branching eliminated entirely

API note: Destination buffers must have room for the full uncompressed vector (8/16 elements). This enables the fast path—the mask is unknown at compile time, so callers should always allocate worst-case.

General Performance Notes

  • AVX-512: Native compress instructions (VPCOMPRESSD, VPCOMPRESSB) are ~3-5× faster than shuffle-based fallback
  • NEON u32 shuffle: Uses TBL/TBL2 with byte-level indexing (converts u32 indices to byte offsets)
  • Bitmask expansion: Parallel vceq/vmovl chain replaces scalar loop
  • Lookup tables:
    • u32x8 compress indices: 256×8×4 = 8KB (fits in L1 cache)
    • u32x8 byte indices for NEON: 256×32 = 8KB (fits in L1 cache)
    • u8x16 compress indices for NEON: 65536×16 = 1MB (may cause cache pressure on hot paths)
  • SimdSplit: AVX-512 uses single extract instruction; fallback is zero-cost transmute

TODO list

  • Build proper SIMD extensions for memory prefetch, masked VGATHER, etc that are reusable in different places. For example, build traits on top of wide's SIMD types and implement them for different architectures.
  • Refactor and get rid of all of the ugly AI generated intrinsic code
  • Good looking SIMD bitvec core, no AI generated intrinsics
  • As we build the SIMD intrinsics and other lookup utilities, add plenty of RustDoc detailing the WHY's, performance space/memory and other tradeoffs.
Commit count: 0

cargo fmt