typed-ski Rust (WASM arena)
This directory contains the Rust implementation of the arena allocator and
reducer used by typed-ski's WASM evaluators.
Overview
This Rust code targets wasm32-unknown-unknown and runs inside WebAssembly.
It provides:
- Hash-consing arena allocation with structural sharing for memory
efficiency
- Iterative reduction algorithm to prevent stack overflow on deep
expressions
- Preemptive multitasking with cooperative yielding for fair worker
scheduling
- Lock-free ring buffers (io_uring-style) for efficient inter-thread
communication
- SharedArrayBuffer
support for true parallel evaluation across Web Workers
- Dynamic arena growth with seqlock-based synchronization
Architecture
The arena allocator supports two modes:
- Single-threaded (Heap Mode): Uses local WASM memory with lazy allocation
- Multi-threaded (SAB Mode): Uses
WebAssembly shared memory (SharedArrayBuffer) to enable
concurrent access from multiple Web Workers, with atomic operations and
lock-free synchronization for thread safety
Hash consing is implemented using a hash table with chaining, ensuring that
expressions with the same structure (same left and right children) are
automatically deduplicated. The arena can grow dynamically up to a fixed maximum
capacity.
Thread Safety and Locking
In SAB (SharedArrayBuffer) mode, the arena uses lock-free and atomic-based
concurrency control to enable high-performance concurrent access:
Optimistic Concurrency Control
Hash-consing allocation uses optimistic concurrency control:
- Optimistic read: Traverse bucket chain without locking
- Pre-allocate ID: Use
top.fetch_add() to reserve slot atomically
- CAS insertion: Attempt to insert into hash table bucket
- Graceful failure: If CAS fails, slot becomes a "hole" (
kind = 0)
Correctness guarantee: Canonical node IDs are always returned. Readers treat
non-NonTerm kinds as terminal values, so holes don't affect evaluation.
Lock-Free Ring Buffers
Inter-thread communication uses lock-free ring buffers (io_uring-style):
- Submission Queue (SQ): Main thread → Worker communication
- Completion Queue (CQ): Worker → Main thread results
- Atomic operations: Wait-free producer/consumer patterns
- Sequence numbers: ABA prevention in concurrent access
- Blocking waits: Efficient WASM atomic wait/notify
Resize Synchronization
Arena growth uses a seqlock-style approach:
- Stop-the-world pauses: All threads spin during resize operations
- Sequence lock: Odd values indicate resize in progress, even values
indicate stable
- Reverse-order copying: Prevents overlap issues during memory migration
- Bucket rebuild: Hash table reconstructed after resize
- Poisoning on failure: Unrecoverable errors set poison sequences
Key Optimizations
1. Hash-Consing (Structural Sharing)
The arena uses hash consing to
deduplicate identical subexpressions:
- Avalanche hashing: Fast, high-quality hash function for node pairs
- Separate chaining:
Collision resolution in hash table buckets
- Memory efficiency:
DAG representation
instead of tree (common subexpressions shared)
- O(1) lookups: Fast deduplication during expression construction
2. Iterative Reduction Algorithm
Avoids recursion stack overflow on deep expressions:
- Explicit stack management: Uses Continuation nodes instead of call stack
- Two-phase reduction: Left child first, then right child
- Memory recycling: Dead continuation frames reused immediately
- Scalable depth: No theoretical limit on expression depth
3. Preemptive Multitasking
Enables fair scheduling in parallel evaluation:
- Gas-based yielding: Workers yield control when traversal budget exhausted
- Suspension nodes: Capture complete evaluation state for resumption
- Cooperative scheduling: Prevents worker starvation
- State serialization: Evaluation context stored directly in arena
4. Lock-Free Ring Buffers
High-performance inter-thread communication:
Arena Node Types
The arena supports four node types for different purposes:
Terminal: SKI combinators (S, K, I) - leaf nodes
NonTerm: Function application (left right) - core expressions
Continuation: Stack frames for iterative reduction - optimization
Suspension: Paused evaluation state for multitasking - optimization
Execution Model (Host ↔ Workers)
- The host process (JS/TS) is responsible for building expressions in the
arena (via
allocTerminal/allocCons).
- The host then submits work to workers using the SQ/CQ rings.
- Workers run an exported infinite loop (
workerLoop) that:
- blocks waiting for SQ entries (
memory.atomic.wait32)
- reduces the submitted root using iterative algorithm with gas limits
- yields control via Suspensions when gas exhausted (preemptive multitasking)
- enqueues a CQ completion when reduction finishes or yields
Correlation IDs
Work is correlated using a 32-bit req_id chosen by the host.
- SQ entry:
{ node_id: u32, req_id: u32 }
- CQ entry:
{ node_id: u32, req_id: u32 }
Completions may arrive out-of-order; req_id is how the host matches results to
callers.
Thread Safety & Concurrency
Hash-Consing Under Concurrency
allocCons implements optimistic concurrency control:
- Optimistic read: Traverse bucket chain without locking
- Pre-allocate ID: Use
top.fetch_add() to reserve slot
- CAS insertion: Attempt
to insert into hash table bucket
- Graceful failure: If CAS fails, slot becomes a "hole" (
kind = 0)
Correctness guarantee: Canonical node IDs are always returned. Readers treat
non-NonTerm kinds as terminal values, so holes don't affect evaluation.
Stop-the-World Resizing
Uses seqlock-style synchronization for
arena growth:
- Odd sequence: Resize in progress (writers spin, readers retry)
- Array migration: Reverse-order copying prevents overlap issues
- Bucket rebuild: Hash table reconstructed after resize
- Even sequence: Stable state (normal operation resumes)
Ring Buffer Concurrency
- Single producer/consumer: Each ring has dedicated thread roles
- ABA prevention: Sequence
numbers prevent concurrent access issues
- Memory barriers: Proper
acquire/release ordering for visibility
- Wait/notify:
Efficient blocking with WASM atomic operations
Shared Memory Safety
- Typed array views: JS-side access to arena arrays
- Cache coherence: 64-byte alignment prevents false sharing
- Poisoning on failure: Unrecoverable errors set poison sequences
- Cross-thread visibility: SharedArrayBuffer enables true parallelism
Performance Characteristics
Time Complexity
- Allocation: O(1) amortized (hash table lookup + potential growth)
- Hash-consing lookup: O(1) average case (hash table with chaining)
- Reduction step: O(depth) with iterative algorithm (no stack overflow)
- Ring operations: O(1) (lock-free with atomic CAS)
Space Complexity
- Per node: ~16 bytes (4 arrays × 4 bytes + metadata)
- Hash table: O(capacity) for buckets and collision chains
- Structural sharing: Significant savings for common subexpressions
- Memory efficiency: ~60% less memory than naive tree representation
Scalability
- Expression depth: Unlimited (iterative vs recursive)
- Parallel workers: Linear scaling with CPU cores
- Memory growth: Dynamic resizing with seqlock synchronization
- Concurrent allocations: Lock-free with optimistic concurrency
Memory Layout
The shared arena is laid out as header + fixed rings + SoA arrays.
Header (SabHeader)
- Magic number, ring parameters, memory offsets
- Capacity, bucket masks, atomic counters
max_steps, resize_seq, top for coordination
Rings (64-byte aligned)
- SQ (Submission Queue): Host → Worker communication
- CQ (Completion Queue): Worker → Host results
- Lock-free with sequence numbers for ABA prevention
SoA Arrays (Struct of Arrays)
kind: u8[] - Node type (Terminal/NonTerm/Continuation/Suspension)
sym: u8[] - Symbol/mode data
left_id: u32[] - Left child pointers
- `right_id: u32[] - Right child pointers
hash32: u32[] - Hash values for deduplication
next_idx: u32[] - Hash table collision chains
buckets: u32[] - Hash table bucket heads
- Terminal cache: Fast access to S/K/I nodes
The JS/TS side constructs typed array views over these arrays for fast read-only
decoding and direct memory access.
Building
The build is managed by Nix. Cargo.toml is generated by the flake.
nix run .#generate-cargo
cargo build --target wasm32-unknown-unknown --release
WASM Exports
Arena / reducer
initArena(capacity: u32) -> u32: allocate and initialize a shared arena
region (returns base pointer or error code).
connectArena(ptr: u32) -> u32: connect an instance to an existing arena
(returns 1 on success).
reset()
allocTerminal(sym: u32) -> u32
allocCons(left: u32, right: u32) -> u32
arenaKernelStep(expr: u32) -> u32
reduce(expr: u32, max: u32) -> u32
SQ/CQ interface
hostSubmit(node_id: u32, req_id: u32, max_steps: u32) -> u32
- returns
0 ok, 1 SQ full, 2 not connected
max_steps is the reduction limit for this specific request (eliminates
race conditions)
hostPull() -> i64
- returns
-1 if CQ empty, otherwise a packed value:
- high 32 bits:
req_id
- low 32 bits:
node_id (result root)
workerLoop() -> !: blocking loop for worker instances (never returns).
Debug
debugLockState() -> u32: returns resize_seq & 1 (1 if resizing).
getArenaMode() -> u32: 0 heap mode, 1 SAB mode.
debugGetArenaBaseAddr() -> u32
debugCalculateArenaSize(capacity: u32) -> u32
WASM Imports
memory: WebAssembly.Memory (provided via env.memory) - The shared memory
instance (must be created with shared: true for SAB mode)
js_allow_block() -> i32 - Returns 0 for main thread (spin-only) or 1 for
worker threads (can block). This is instance-local and not stored in shared
memory to avoid state collisions.
Further Reading
For deeper understanding of the techniques used:
Structure
src/lib.rs: crate entry point
src/arena.rs: arena allocator + rings + reducer