| Crates.io | simd-lookup |
| lib.rs | simd-lookup |
| version | 0.1.0 |
| created_at | 2025-12-18 23:34:38.370875+00 |
| updated_at | 2025-12-18 23:34:38.370875+00 |
| description | High-performance SIMD utilities for fast table lookups, compression and data processing |
| homepage | https://github.com/velvia/simd-lookup |
| repository | https://github.com/velvia/simd-lookup |
| max_upload_size | |
| id | 1993843 |
| size | 314,480 |
High-performance SIMD utilities for fast table lookups, compression and data processing in Rust.
wide crateThis crate automatically detects and uses the best available CPU features, with fallbacks for older CPUs. The crate is optimized for both ARM NEON (aarch64) and Intel AVX-512 (x86_64) architectures.
Note: Table64 is primarily optimized for ARM NEON using the TBL4 instruction, which provides
excellent performance on Apple Silicon and other ARMv8+ CPUs. On Intel x86_64, it requires newer AVX-512
features (Ice Lake+).
| Module/Feature | Required CPU Features | Available CPUs | Fallback |
|---|---|---|---|
simd_compress (compress_store_u32x8) |
AVX512F + AVX512VL (x86), NEON TBL2 (ARM) | Skylake-X+, Ice Lake+, All ARM | NEON TBL on ARM, Shuffle table elsewhere |
simd_compress (compress_store_u32x16) |
AVX512F | Skylake-X+, Ice Lake+ | Two u32x8 compresses |
simd_compress (compress_store_u8x16) |
AVX512VBMI2 + AVX512VL (x86), NEON TBL (ARM) | Ice Lake+, Tiger Lake+, All ARM | NEON TBL on ARM, gather-style writes elsewhere |
simd_gather (gather_u32index_u8) |
AVX512F + AVX512BW | Skylake-X+, Ice Lake+ | Scalar loop |
simd_gather (gather_u32index_u32) |
AVX512F | Skylake-X+, Ice Lake+ | Scalar loop |
| Table64 | ARM NEON TBL4 (aarch64) or AVX512BW + AVX512VBMI (x86_64) | All ARMv8+ (Apple Silicon), Ice Lake+ | Scalar lookup (x86_64 only) |
| Table2dU8xU8 | AVX512F + AVX512BW | Skylake-X+, Ice Lake+ | Scalar lookup |
| Cascading Lookup Kernel | AVX512F + AVX512VL + AVX512BW + AVX512VBMI2 | Ice Lake+, Tiger Lake+ | Scalar lookup |
simd_compress module)compress_store_u32x8:
VPCOMPRESSD instructioncompress_store_u32x16: Requires AVX512F
VPCOMPRESSD instruction (512-bit variant)compress_store_u32x8 operationscompress_store_u8x16:
VPCOMPRESSB instructionvqtbl1q_u8) with precomputed shuffle indices
simd_gather module)gather_u32index_u8: Requires AVX512F + AVX512BW
VGATHERDPS + VPMOVDB instructionsgather_u32index_u32: Requires AVX512F
VGATHERDPS instructionsmall_table module)Table64: Highly optimized for ARM NEON (primary optimization target)
TBL4 instruction (vqtbl4q_u8)
VPERMB instruction (_mm512_permutexvar_epi8) for 64-byte table lookupsTable2dU8xU8: Requires AVX512F + AVX512BW (via simd_gather)
VGATHERDPS + VPMOVDB for parallel lookupslookup_kernel module)SimdCascadingTableU32U8Lookup: Requires AVX512F + AVX512VL + AVX512BW + AVX512VBMI2
compress_store_u8x16, compress_store_u32x16, and gather_u32index_u8You can check which features your CPU supports:
# Linux
grep flags /proc/cpuinfo | head -1
# Or use Rust's feature detection
cargo run --example check_features
All functions automatically detect available CPU features at runtime and use the best available implementation.
wide_utils module)This crate provides a rich set of SIMD utilities built on top of the wide crate, with optimized implementations for x86_64 (AVX-512/AVX2) and aarch64 (NEON).
simd_compress module)Stream compaction similar to AVX-512's VCOMPRESS instruction — pack selected elements contiguously based on a bitmask.
🚀 Highly optimized for ARM NEON — achieves up to 12 Gelem/s on Apple Silicon!
use simd_lookup::{compress_store_u32x8, compress_store_u32x16, compress_store_u8x16};
use wide::{u32x8, u32x16, u8x16};
// Compress u32x8: select elements where mask bits are set
let data = u32x8::from([10, 20, 30, 40, 50, 60, 70, 80]);
let mask = 0b10110010u8; // Select positions 1, 4, 5, 7
let mut output = [0u32; 8]; // Must have room for full vector!
let count = compress_store_u32x8(data, mask, &mut output);
// count == 4, output[0..4] == [20, 50, 60, 80]
// Also available for u32x16 (512-bit) and u8x16
Note: Destination buffer must have room for the full uncompressed vector (8/16 elements). This enables fast direct NEON stores instead of variable-length copies.
| Function | AVX-512 | ARM NEON | Throughput (ARM) |
|---|---|---|---|
compress_store_u32x8 |
VPCOMPRESSD |
TBL2 + direct store |
~4.3 Gelem/s |
compress_store_u32x16 |
VPCOMPRESSD |
2× NEON u32x8 | ~5.3 Gelem/s |
compress_store_u8x16 |
VPCOMPRESSB |
TBL + direct store |
~12 Gelem/s |
Variable-index shuffle using the same SIMD type for indices (zero-copy from lookup tables):
use simd_lookup::WideUtilsExt;
use wide::u32x8;
let data = u32x8::from([10, 20, 30, 40, 50, 60, 70, 80]);
let indices = u32x8::from([7, 6, 5, 4, 3, 2, 1, 0]); // Reverse
let reversed = data.shuffle(indices);
// reversed == [80, 70, 60, 50, 40, 30, 20, 10]
| Type | AVX2 | NEON | Scalar |
|---|---|---|---|
u32x8 |
VPERMD |
TBL2 (byte-level) |
Loop |
u32x4 |
— | TBL (byte-level) |
Loop |
u8x16 |
PSHUFB |
TBL |
Loop |
SimdSplit trait)Efficiently extract high/low halves of wide vectors:
use simd_lookup::SimdSplit;
use wide::u32x16;
let data = u32x16::from([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]);
let (lo, hi) = data.split_low_high();
// lo: u32x8 = [1,2,3,4,5,6,7,8]
// hi: u32x8 = [9,10,11,12,13,14,15,16]
// Or extract just one half
let low_half = data.low_half();
let high_half = data.high_half();
| Type | AVX-512 | Fallback |
|---|---|---|
u32x16 → u32x8 |
_mm512_extracti64x4_epi64 |
Array slicing |
u64x8 → u64x4 |
_mm512_extracti64x4_epi64 |
Array slicing |
Zero-extend smaller types to larger types:
use simd_lookup::WideUtilsExt;
use wide::{u32x8, u64x8};
let input = u32x8::from([1, 2, 3, 4, 5, 6, 7, 8]);
let widened: u64x8 = input.widen_to_u64x8();
// widened == [1u64, 2, 3, 4, 5, 6, 7, 8]
| Type | AVX-512 | AVX2 | NEON |
|---|---|---|---|
u32x8 → u64x8 |
VPMOVZXDQ |
2× VPMOVZXDQ |
VMOVL |
u32x4 → u64x4 |
— | VPMOVZXDQ |
VMOVL |
Convert a scalar bitmask to a SIMD mask vector:
use simd_lookup::FromBitmask;
use wide::u64x8;
let mask = 0b10101010u8;
let mask_vec: u64x8 = u64x8::from_bitmask(mask);
// mask_vec == [0, MAX, 0, MAX, 0, MAX, 0, MAX]
| Type | AVX-512 | ARM NEON | AVX2/Other |
|---|---|---|---|
u64x8 |
VPBROADCASTQ + mask |
VCEQ + VMOVL chain |
Loop |
u32x8 |
VPBROADCASTD + mask |
VCEQ + VMOVL chain |
Loop |
double() method on WideUtilsExt)Efficiently double each element via self + self. Addition is well-supported on all architectures (NEON vaddq, SSE paddb), making this the most efficient way to multiply by powers of 2:
use simd_lookup::WideUtilsExt;
use wide::u8x16;
let a = u8x16::from([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]);
// x * 2
let doubled = a.double();
// doubled == [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]
// x * 8 (chain three doubles)
let times_8 = a.double().double().double();
// times_8 == [8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128]
This is more efficient than scalar multiplication for types like u8x16 where x86 lacks native byte multiply/shift instructions.
Pre-computed shuffle indices for compress operations (256 entries for 8-element masks):
use simd_lookup::{SHUFFLE_COMPRESS_IDX_U32X8, get_compress_indices_u32x8};
// Raw array access
let indices: [u32; 8] = SHUFFLE_COMPRESS_IDX_U32X8[0b10110010];
// indices == [1, 4, 5, 7, 7, 7, 7, 7] (unused positions filled with 7)
// Zero-cost SIMD access via transmute
let simd_indices = get_compress_indices_u32x8(0b10110010u8);
small_table — Small Table SIMD Lookup64-entry lookup table primarily optimized for ARM NEON TBL4 (excellent performance on Apple Silicon)
and also supports AVX-512 VPERMB on Intel Ice Lake+. Useful for fast pattern detection and small dictionary lookups.
prefetch — SIMD Memory PrefetchCross-platform memory prefetch utilities including masked prefetch for 8 addresses at once. Supports L1/L2/L3 cache hints.
bulk_vec_extender — Efficient Vec ExtensionUtilities for efficiently extending Vec with SIMD-produced results, minimizing bounds checks and reallocations.
entropy_map_lookup — Entropy-Optimized LookupsLookup structures optimized for low-entropy (few unique values) data, using bitpacking and small lookup tables.
eight_value_lookup — 8-Value Fast PathSpecialized lookup for tables with ≤8 unique values, using SIMD comparison and bitmask extraction.
The NEON compress operations achieve exceptional throughput through optimized direct vector stores:
| Operation | Throughput | vs Scalar |
|---|---|---|
compress_store_u8x16 |
~12 Gelem/s | ~8× faster |
compress_store_u32x8 |
~4.3 Gelem/s | ~3-4× faster |
compress_store_u32x16 |
~5.3 Gelem/s | ~5-6× faster |
Key optimizations:
vst1q_u8 to write full vectors instead of variable-length copiescompress_store_u8x16 uses one vqtbl1q_u8 for 16-byte shuffleAPI note: Destination buffers must have room for the full uncompressed vector (8/16 elements). This enables the fast path—the mask is unknown at compile time, so callers should always allocate worst-case.
VPCOMPRESSD, VPCOMPRESSB) are ~3-5× faster than shuffle-based fallbackTBL/TBL2 with byte-level indexing (converts u32 indices to byte offsets)vceq/vmovl chain replaces scalar loop