ruvector-mincut-gated-transformer

Crates.ioruvector-mincut-gated-transformer
lib.rsruvector-mincut-gated-transformer
version0.1.0
created_at2025-12-29 19:13:00.284285+00
updated_at2025-12-29 19:13:00.284285+00
descriptionUltra low latency transformer inference with mincut-gated coherence control
homepage
repositoryhttps://github.com/ruvnet/ruvector
max_upload_size
id2011127
size772,633
rUv (ruvnet)

documentation

README

Mincut-Gated Transformer

Ultra-low latency transformer inference with graph-theoretic coherence control, designed for real-time AI systems and edge deployment

Crates.io Documentation License

Introduction

The Mincut-Gated Transformer is a production-grade inference engine that combines minimum cut (mincut) graph partitioning with adaptive compute allocation to achieve deterministic, ultra-low latency inference. Unlike traditional transformers that execute all layers uniformly, this architecture uses graph-theoretic coherence signals to dynamically skip computation, exit early, and control state updates—all while maintaining explainability and safety guarantees.

Why Mincut? The minimum cut value (λ) of an attention graph provides a principled measure of information flow coherence. When λ is high and stable, the model can safely reduce computation. When λ drops or becomes unstable, the system conservatively executes more layers. This creates a natural feedback loop between model confidence and compute allocation.

Key Innovations

Innovation Technique Benefit
λ-based Mixture-of-Depths Route tokens using mincut delta instead of learned routers 50% FLOPs reduction
Coherence-driven Early Exit Exit when λ stabilizes across layers 30-50% latency reduction
Mincut Sparse Attention Use partition boundaries for sparse masks 90% attention FLOPs reduction
Energy-based Gating Treat coherence as energy function Principled compute-quality tradeoffs
Spike-driven Scheduling Event-driven inference on activity 87× energy efficiency
Spectral Position Encoding Graph Laplacian eigenvectors via Lanczos O(n) structural awareness
EAGLE-3 Speculative Decoding λ-guided draft tree verification 3-5× decoding speedup
Mamba SSM Hybrid Selective state spaces with O(n) complexity Linear-time sequence modeling
FlashAttention Tiling Block-wise attention with online softmax O(n) memory, 2-4× faster
KV Cache INT4 Hadamard transform + 2/4-bit quantization 8-16× cache compression
RoPE with NTK/YaRN Context extension beyond training length 4-32× context scaling

Features

Core Capabilities

  • Deterministic inference — Same inputs always produce identical outputs (bit-exact)
  • Bounded latency — Predictable p99 guarantees through tier-based execution
  • Explainable decisions — Every inference produces a witness explaining all interventions
  • Allocation-free hot path — Zero heap allocations during inference after initialization
  • Safety controls — Coherence-gated state updates prevent contamination propagation

Quantization & Memory

  • INT8 quantization — Full model quantization with per-tensor and per-row scaling
  • INT4 quantization — 2× memory reduction with per-row and block-wise scaling
  • Arena allocator — Single contiguous allocation for weights, 64-byte cache-aligned
  • Sparse CSR matrices — Efficient storage for spectral graph operations

SIMD Acceleration

  • AVX2/FMA (x86_64) — Vectorized GEMM, GELU, quantization with 8×32 tiling
  • NEON (aarch64) — ARM SIMD for mobile and edge devices
  • Scalar fallback — Portable implementation for all platforms

Advanced Features

  • Lanczos algorithm — O(n) eigenvalue computation for spectral position encoding
  • Power iteration — Fast dominant eigenvector extraction
  • Prefetch hints — Memory access optimization for sequential patterns
  • Benchmark utilities — Built-in profiling with GFLOPS and bandwidth metrics

SOTA 2025 Features

  • KV Cache INT4 (RotateKV) — Hadamard transforms for outlier smoothing, 2-bit/4-bit quantization with <0.3 PPL degradation
  • RoPE Embeddings — Rotary position encoding with NTK-aware and YaRN scaling for 4-32× context extension
  • EAGLE-3 Speculative Decoding — λ-guided draft tree generation with rejection sampling for 3-5× faster decoding
  • FlashAttention Tiling — Block-wise computation with online softmax, O(n) memory instead of O(n²)
  • Mamba SSM Layer — Selective state space models with O(n) complexity and O(1) inference memory per step
  • Criterion Benchmarks — Comprehensive kernel performance profiling with GFLOPS metrics

Quick Start

use ruvector_mincut_gated_transformer::prelude::*;

// Create configuration
let config = TransformerConfig::micro();
let policy = GatePolicy::default();

// Load weights (or use empty for testing)
let weights = QuantizedWeights::empty(&config);

// Create transformer
let mut transformer = MincutGatedTransformer::new(config, policy, weights)?;

// Create gate packet from mincut signals
let gate = GatePacket {
    lambda: 100,              // Minimum cut value
    lambda_prev: 95,          // Previous lambda for delta computation
    boundary_edges: 5,        // Cross-partition edge count
    boundary_concentration_q15: 8192,  // ~25% concentration (Q15 format)
    partition_count: 3,       // Number of detected partitions
    flags: 0,
};

// Prepare input
let input = InferInput::from_tokens(&[1, 2, 3, 4], gate);

// Allocate output buffer
let mut logits = vec![0i32; config.logits as usize];
let mut output = InferOutput::new(&mut logits);

// Run inference
transformer.infer(&input, &mut output)?;

// Check witness for gate decisions
println!("Decision: {:?}", output.witness.decision);
println!("Reason: {:?}", output.witness.reason);
println!("External writes allowed: {}", output.witness.external_writes_enabled);

Architecture Overview

                    ┌─────────────────┐
                    │   Gate Packet   │
                    │  (λ, Δλ, edges) │
                    └────────┬────────┘
                             │
    Input ──────────────────►│
                             ▼
                    ┌─────────────────┐
                    │ Spike Scheduler │──── Skip (tier 3)
                    │  Event-driven   │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │ Gate Controller │──── Select tier 0/1/2
                    │ Coherence-gated │
                    └────────┬────────┘
                             │
                             ▼
              ┌──────────────┴──────────────┐
              │      Transformer Core       │
              │  ┌────────────────────────┐ │
              │  │ MoD Router (λ-based)   │ │
              │  └───────────┬────────────┘ │
              │              ▼              │
              │  ┌────────────────────────┐ │
              │  │ Sparse Attention       │ │
              │  │ (mincut boundaries)    │ │
              │  └───────────┬────────────┘ │
              │              ▼              │
              │  ┌────────────────────────┐ │
              │  │ Early Exit Check       │ │──── Exit if λ stable
              │  │ (coherence threshold)  │ │
              │  └───────────┬────────────┘ │
              └──────────────┴──────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │ Output + Witness│
                    │  (explainable)  │
                    └─────────────────┘

Tier System

Tier Layers Seq Len Window Use Case Speedup
0 4 64 16 Normal (high λ)
1 2 32 8 Reduced (moderate λ) 2-3×
2 1 8 4 Safe mode (low λ) 5-10×
3 0 0 0 Skip (no spike) 50-200×

Performance

Expected Speedups

Workload Type Skip Rate Speedup Memory Reduction
Streaming (low activity) 70% 10-15× 80%
Interactive (bursty) 40% 4-6× 50%
Continuous (high throughput) 10% 2-3× 40%
Safety-critical (conservative) 5% 1.5-2× 25%

SIMD Performance (on x86_64 AVX2)

Operation Scalar SIMD Speedup
INT8 GEMM (256×256) 12ms 1.8ms 6.7×
GELU activation (1024) 45µs 8µs 5.6×
Quantize f32→i8 (1024) 38µs 7µs 5.4×

Memory Footprint

Model Config INT8 INT4 Arena Overhead
Micro (2L, 128H) 1.2 MB 0.6 MB +64 bytes
Baseline (4L, 256H) 8.5 MB 4.3 MB +64 bytes
Medium (12L, 768H) ~85 MB ~43 MB +64 bytes

Configuration

Preset Configurations

// Micro: WASM, edge gateways, embedded
let config = TransformerConfig::micro();
// Seq: 32, Hidden: 128, Heads: 4, Layers: 2

// Baseline: CPU inference, development
let config = TransformerConfig::baseline();
// Seq: 64, Hidden: 256, Heads: 4, Layers: 4

Gate Policy

let policy = GatePolicy {
    lambda_min: 30,                         // Minimum coherence threshold
    drop_ratio_q15_max: 16384,              // Max λ drop (50% in Q15)
    boundary_edges_max: 20,                 // Max cross-partition edges
    boundary_concentration_q15_max: 24576,  // Max concentration (75%)
    partitions_max: 8,                      // Max partition count
    spike_rate_q15_max: 26214,              // Max spike rate (80%)
    allow_kv_write_when_unstable: false,    // Freeze KV cache
    allow_external_write_when_unstable: false, // Block external writes
};

Feature Flags

Core Features

  • sliding_window (default) — Sliding window attention
  • linear_attention — Linear attention for O(n) scaling

Quantization

  • simd — AVX2/NEON SIMD acceleration
  • int4 — INT4 quantization support
  • fixed_point_softmax — Fixed-point for embedded targets
  • rmsnorm — RMSNorm instead of LayerNorm

Advanced

  • spectral_pe — Spectral position encoding with Lanczos
  • sparse_attention — Mincut-guided sparse attention
  • energy_gate — Energy-based gate decisions
  • spike_attention — Spike-driven attention mechanism
  • trace — Runtime tracing and snapshots

Platform

  • wasm — WebAssembly support
  • no_std_gateway — No-std for embedded gateways

Current Limitations

Feature Status Notes
GPU inference Not implemented CUDA/Metal kernels needed
KV cache persistence Implemented INT4 with Hadamard transforms
Multi-head grouped query Not implemented GQA for memory efficiency
Flash Attention Implemented CPU tiled with online softmax
Rotary position embeddings Implemented RoPE with NTK/YaRN scaling
Criterion benchmarks Implemented Kernel, gate, latency benchmarks
GGML/GGUF format Not implemented Model format compatibility
Batched inference Partial Single-sequence optimized
Async/streaming output Not implemented Token-by-token streaming
Mamba/SSM hybrid Implemented Selective state space layer
Speculative decoding Implemented EAGLE-3 style with λ-guidance

Academic Foundations

This implementation integrates peer-reviewed research:

Core Architecture

  1. Mixture-of-Depths (Raposo et al., 2024) — Dynamic compute allocation
  2. LayerSkip (Elhoushi et al., 2024) — Early exit and self-speculative decoding
  3. MInference (Jiang et al., 2024) — Dynamic sparse attention
  4. Energy-Based Transformers (Gladstone et al., 2025) — Energy-based decisions
  5. Spike-driven Transformer (Yao et al., 2023, 2024) — Event-driven inference
  6. Spectral Attention (Kreuzer et al., 2021) — Graph-based position encoding

SOTA 2025 Research

  1. RotateKV (IJCAI 2025) — Hadamard transforms for KV cache quantization
  2. EAGLE-3 (NeurIPS 2025) — Speculative decoding with draft tree verification
  3. FlashAttention-3 (Dao et al., 2024) — IO-aware attention with online softmax
  4. Mamba (Gu & Dao, 2023) — Selective State Space Models
  5. Mamba-2 (Dao & Gu, 2024) — Structured state space duality
  6. RoFormer (Su et al., 2021) — Rotary position embeddings
  7. YaRN (Peng et al., 2023) — Efficient context window extension
  8. NTK-Aware Scaling (bloc97, 2023) — Base frequency adjustment for context extension

See docs/THEORY.md for detailed theoretical foundations.

Integration

With RuVector Mincut

use ruvector_mincut_gated_transformer::prelude::*;
use ruvector_mincut::MincutEngine;

// Compute mincut from attention graph
let mut mincut = MincutEngine::new(num_nodes);
// ... add edges from attention weights ...
let lambda = mincut.compute_mincut();

// Create gate packet
let gate = GatePacket {
    lambda,
    lambda_prev: prev_lambda,
    boundary_edges: mincut.boundary_edge_count(),
    ..Default::default()
};

// Run gated inference
transformer.infer(&InferInput::from_tokens(tokens, gate), &mut output)?;

Arena Allocator

use ruvector_mincut_gated_transformer::arena::{WeightArena, calculate_arena_size};

// Calculate total size for model
let size = calculate_arena_size(layers, hidden, ffn_mult, heads);
let mut arena = WeightArena::new(size);

// Allocate weight slices
let w_q = arena.alloc_i8(hidden * hidden).unwrap();
let scales = arena.alloc_f32(hidden).unwrap();

INT4 Quantization

use ruvector_mincut_gated_transformer::kernel::quant4::{Int4Weights, int4_gemv};

// Create INT4 weights from f32 (50% memory savings)
let int4_w = Int4Weights::from_f32(&weights, rows, cols);

// Matrix-vector multiplication
int4_gemv(&int4_w, &input, 1.0, &mut output);

KV Cache INT4 (RotateKV)

use ruvector_mincut_gated_transformer::kv_cache::{QuantizedKVCache, QuantBits};

// Create 2-bit quantized KV cache (16× compression)
let mut cache = QuantizedKVCache::new(
    num_layers,
    num_heads,
    head_dim,
    max_seq_len,
    QuantBits::Two,
);

// Store key/value with automatic Hadamard transform
cache.store_key(layer, head, position, &key_vector);
cache.store_value(layer, head, position, &value_vector);

// Retrieve (dequantize + inverse Hadamard)
let key = cache.get_key(layer, head, position);

RoPE Embeddings

use ruvector_mincut_gated_transformer::rope::{RopeConfig, RopeEmbedding, RopeScaling};

// Standard RoPE
let config = RopeConfig::default();
let rope = RopeEmbedding::new(&config)?;

// NTK-aware scaling for 4× context extension
let config = RopeConfig {
    scaling_type: RopeScaling::NTKAware { alpha: 4.0 },
    ..Default::default()
};

// Apply to Q/K vectors
rope.apply(&mut q, &mut k, position);

FlashAttention Tiling

use ruvector_mincut_gated_transformer::flash_attention::{
    FlashAttentionConfig, flash_attention_forward,
};

let config = FlashAttentionConfig {
    block_size_q: 64,
    block_size_kv: 64,
    head_dim: 64,
    causal: true,
    softmax_scale: 0.125,
};

// O(n) memory attention
flash_attention_forward(&config, &q, &k, &v, seq_len, seq_len, &mut output);

Mamba SSM Layer

use ruvector_mincut_gated_transformer::mamba::{MambaConfig, MambaLayer};

let config = MambaConfig::default();
let mut layer = MambaLayer::new(config);

// Recurrent mode (O(1) memory per step)
for token in tokens.iter() {
    let output = layer.step_recurrent(token);
}

// Batch mode for training
let outputs = layer.forward_sequence(&input_sequence);

EAGLE-3 Speculative Decoding

use ruvector_mincut_gated_transformer::speculative::{
    SpeculativeConfig, SpeculativeDecoder,
};

let config = SpeculativeConfig {
    max_draft_tokens: 8,
    tree_width: 4,
    acceptance_threshold: 0.9,
    lambda_guidance: true,  // Use mincut λ for tree construction
};

let mut decoder = SpeculativeDecoder::new(config, &gate_policy);

// Generate with speculation (3-5× faster)
let (tokens, stats) = decoder.generate_with_speculation(
    &draft_model,
    &target_model,
    &prompt,
    max_new_tokens,
);

Safety & Determinism

Determinism guarantee: For fixed (weights, config, policy, input), inference always produces identical (logits, witness).

Safety properties:

  • External writes blocked when coherence is low
  • KV cache frozen/flushed on instability
  • All gate decisions recorded in witness
  • No hidden state or randomness

Witness fields:

witness.decision        // ALLOW, DEFER, QUARANTINE, SKIP
witness.reason          // Why this decision was made
witness.external_writes_enabled  // Safe to persist?
witness.kv_action       // WRITE, FREEZE, FLUSH

License

Licensed under either of Apache License 2.0 or MIT license at your option.

Contributing

Contributions welcome! Areas of interest:

  • GPU kernel implementations (CUDA, Metal)
  • Additional quantization formats (GPTQ, AWQ)
  • Multi-head grouped query attention (GQA)
  • GGUF/Safetensors model format loaders
  • Batched inference optimization
  • Async/streaming token output
Commit count: 729

cargo fmt