| Crates.io | ruvector-mincut-gated-transformer |
| lib.rs | ruvector-mincut-gated-transformer |
| version | 0.1.0 |
| created_at | 2025-12-29 19:13:00.284285+00 |
| updated_at | 2025-12-29 19:13:00.284285+00 |
| description | Ultra low latency transformer inference with mincut-gated coherence control |
| homepage | |
| repository | https://github.com/ruvnet/ruvector |
| max_upload_size | |
| id | 2011127 |
| size | 772,633 |
Ultra-low latency transformer inference with graph-theoretic coherence control, designed for real-time AI systems and edge deployment
The Mincut-Gated Transformer is a production-grade inference engine that combines minimum cut (mincut) graph partitioning with adaptive compute allocation to achieve deterministic, ultra-low latency inference. Unlike traditional transformers that execute all layers uniformly, this architecture uses graph-theoretic coherence signals to dynamically skip computation, exit early, and control state updates—all while maintaining explainability and safety guarantees.
Why Mincut? The minimum cut value (λ) of an attention graph provides a principled measure of information flow coherence. When λ is high and stable, the model can safely reduce computation. When λ drops or becomes unstable, the system conservatively executes more layers. This creates a natural feedback loop between model confidence and compute allocation.
| Innovation | Technique | Benefit |
|---|---|---|
| λ-based Mixture-of-Depths | Route tokens using mincut delta instead of learned routers | 50% FLOPs reduction |
| Coherence-driven Early Exit | Exit when λ stabilizes across layers | 30-50% latency reduction |
| Mincut Sparse Attention | Use partition boundaries for sparse masks | 90% attention FLOPs reduction |
| Energy-based Gating | Treat coherence as energy function | Principled compute-quality tradeoffs |
| Spike-driven Scheduling | Event-driven inference on activity | 87× energy efficiency |
| Spectral Position Encoding | Graph Laplacian eigenvectors via Lanczos | O(n) structural awareness |
| EAGLE-3 Speculative Decoding | λ-guided draft tree verification | 3-5× decoding speedup |
| Mamba SSM Hybrid | Selective state spaces with O(n) complexity | Linear-time sequence modeling |
| FlashAttention Tiling | Block-wise attention with online softmax | O(n) memory, 2-4× faster |
| KV Cache INT4 | Hadamard transform + 2/4-bit quantization | 8-16× cache compression |
| RoPE with NTK/YaRN | Context extension beyond training length | 4-32× context scaling |
use ruvector_mincut_gated_transformer::prelude::*;
// Create configuration
let config = TransformerConfig::micro();
let policy = GatePolicy::default();
// Load weights (or use empty for testing)
let weights = QuantizedWeights::empty(&config);
// Create transformer
let mut transformer = MincutGatedTransformer::new(config, policy, weights)?;
// Create gate packet from mincut signals
let gate = GatePacket {
lambda: 100, // Minimum cut value
lambda_prev: 95, // Previous lambda for delta computation
boundary_edges: 5, // Cross-partition edge count
boundary_concentration_q15: 8192, // ~25% concentration (Q15 format)
partition_count: 3, // Number of detected partitions
flags: 0,
};
// Prepare input
let input = InferInput::from_tokens(&[1, 2, 3, 4], gate);
// Allocate output buffer
let mut logits = vec![0i32; config.logits as usize];
let mut output = InferOutput::new(&mut logits);
// Run inference
transformer.infer(&input, &mut output)?;
// Check witness for gate decisions
println!("Decision: {:?}", output.witness.decision);
println!("Reason: {:?}", output.witness.reason);
println!("External writes allowed: {}", output.witness.external_writes_enabled);
┌─────────────────┐
│ Gate Packet │
│ (λ, Δλ, edges) │
└────────┬────────┘
│
Input ──────────────────►│
▼
┌─────────────────┐
│ Spike Scheduler │──── Skip (tier 3)
│ Event-driven │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Gate Controller │──── Select tier 0/1/2
│ Coherence-gated │
└────────┬────────┘
│
▼
┌──────────────┴──────────────┐
│ Transformer Core │
│ ┌────────────────────────┐ │
│ │ MoD Router (λ-based) │ │
│ └───────────┬────────────┘ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ Sparse Attention │ │
│ │ (mincut boundaries) │ │
│ └───────────┬────────────┘ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ Early Exit Check │ │──── Exit if λ stable
│ │ (coherence threshold) │ │
│ └───────────┬────────────┘ │
└──────────────┴──────────────┘
│
▼
┌─────────────────┐
│ Output + Witness│
│ (explainable) │
└─────────────────┘
| Tier | Layers | Seq Len | Window | Use Case | Speedup |
|---|---|---|---|---|---|
| 0 | 4 | 64 | 16 | Normal (high λ) | 1× |
| 1 | 2 | 32 | 8 | Reduced (moderate λ) | 2-3× |
| 2 | 1 | 8 | 4 | Safe mode (low λ) | 5-10× |
| 3 | 0 | 0 | 0 | Skip (no spike) | 50-200× |
| Workload Type | Skip Rate | Speedup | Memory Reduction |
|---|---|---|---|
| Streaming (low activity) | 70% | 10-15× | 80% |
| Interactive (bursty) | 40% | 4-6× | 50% |
| Continuous (high throughput) | 10% | 2-3× | 40% |
| Safety-critical (conservative) | 5% | 1.5-2× | 25% |
| Operation | Scalar | SIMD | Speedup |
|---|---|---|---|
| INT8 GEMM (256×256) | 12ms | 1.8ms | 6.7× |
| GELU activation (1024) | 45µs | 8µs | 5.6× |
| Quantize f32→i8 (1024) | 38µs | 7µs | 5.4× |
| Model Config | INT8 | INT4 | Arena Overhead |
|---|---|---|---|
| Micro (2L, 128H) | 1.2 MB | 0.6 MB | +64 bytes |
| Baseline (4L, 256H) | 8.5 MB | 4.3 MB | +64 bytes |
| Medium (12L, 768H) | ~85 MB | ~43 MB | +64 bytes |
// Micro: WASM, edge gateways, embedded
let config = TransformerConfig::micro();
// Seq: 32, Hidden: 128, Heads: 4, Layers: 2
// Baseline: CPU inference, development
let config = TransformerConfig::baseline();
// Seq: 64, Hidden: 256, Heads: 4, Layers: 4
let policy = GatePolicy {
lambda_min: 30, // Minimum coherence threshold
drop_ratio_q15_max: 16384, // Max λ drop (50% in Q15)
boundary_edges_max: 20, // Max cross-partition edges
boundary_concentration_q15_max: 24576, // Max concentration (75%)
partitions_max: 8, // Max partition count
spike_rate_q15_max: 26214, // Max spike rate (80%)
allow_kv_write_when_unstable: false, // Freeze KV cache
allow_external_write_when_unstable: false, // Block external writes
};
sliding_window (default) — Sliding window attentionlinear_attention — Linear attention for O(n) scalingsimd — AVX2/NEON SIMD accelerationint4 — INT4 quantization supportfixed_point_softmax — Fixed-point for embedded targetsrmsnorm — RMSNorm instead of LayerNormspectral_pe — Spectral position encoding with Lanczossparse_attention — Mincut-guided sparse attentionenergy_gate — Energy-based gate decisionsspike_attention — Spike-driven attention mechanismtrace — Runtime tracing and snapshotswasm — WebAssembly supportno_std_gateway — No-std for embedded gateways| Feature | Status | Notes |
|---|---|---|
| GPU inference | Not implemented | CUDA/Metal kernels needed |
| KV cache persistence | ✅ Implemented | INT4 with Hadamard transforms |
| Multi-head grouped query | Not implemented | GQA for memory efficiency |
| Flash Attention | ✅ Implemented | CPU tiled with online softmax |
| Rotary position embeddings | ✅ Implemented | RoPE with NTK/YaRN scaling |
| Criterion benchmarks | ✅ Implemented | Kernel, gate, latency benchmarks |
| GGML/GGUF format | Not implemented | Model format compatibility |
| Batched inference | Partial | Single-sequence optimized |
| Async/streaming output | Not implemented | Token-by-token streaming |
| Mamba/SSM hybrid | ✅ Implemented | Selective state space layer |
| Speculative decoding | ✅ Implemented | EAGLE-3 style with λ-guidance |
This implementation integrates peer-reviewed research:
See docs/THEORY.md for detailed theoretical foundations.
use ruvector_mincut_gated_transformer::prelude::*;
use ruvector_mincut::MincutEngine;
// Compute mincut from attention graph
let mut mincut = MincutEngine::new(num_nodes);
// ... add edges from attention weights ...
let lambda = mincut.compute_mincut();
// Create gate packet
let gate = GatePacket {
lambda,
lambda_prev: prev_lambda,
boundary_edges: mincut.boundary_edge_count(),
..Default::default()
};
// Run gated inference
transformer.infer(&InferInput::from_tokens(tokens, gate), &mut output)?;
use ruvector_mincut_gated_transformer::arena::{WeightArena, calculate_arena_size};
// Calculate total size for model
let size = calculate_arena_size(layers, hidden, ffn_mult, heads);
let mut arena = WeightArena::new(size);
// Allocate weight slices
let w_q = arena.alloc_i8(hidden * hidden).unwrap();
let scales = arena.alloc_f32(hidden).unwrap();
use ruvector_mincut_gated_transformer::kernel::quant4::{Int4Weights, int4_gemv};
// Create INT4 weights from f32 (50% memory savings)
let int4_w = Int4Weights::from_f32(&weights, rows, cols);
// Matrix-vector multiplication
int4_gemv(&int4_w, &input, 1.0, &mut output);
use ruvector_mincut_gated_transformer::kv_cache::{QuantizedKVCache, QuantBits};
// Create 2-bit quantized KV cache (16× compression)
let mut cache = QuantizedKVCache::new(
num_layers,
num_heads,
head_dim,
max_seq_len,
QuantBits::Two,
);
// Store key/value with automatic Hadamard transform
cache.store_key(layer, head, position, &key_vector);
cache.store_value(layer, head, position, &value_vector);
// Retrieve (dequantize + inverse Hadamard)
let key = cache.get_key(layer, head, position);
use ruvector_mincut_gated_transformer::rope::{RopeConfig, RopeEmbedding, RopeScaling};
// Standard RoPE
let config = RopeConfig::default();
let rope = RopeEmbedding::new(&config)?;
// NTK-aware scaling for 4× context extension
let config = RopeConfig {
scaling_type: RopeScaling::NTKAware { alpha: 4.0 },
..Default::default()
};
// Apply to Q/K vectors
rope.apply(&mut q, &mut k, position);
use ruvector_mincut_gated_transformer::flash_attention::{
FlashAttentionConfig, flash_attention_forward,
};
let config = FlashAttentionConfig {
block_size_q: 64,
block_size_kv: 64,
head_dim: 64,
causal: true,
softmax_scale: 0.125,
};
// O(n) memory attention
flash_attention_forward(&config, &q, &k, &v, seq_len, seq_len, &mut output);
use ruvector_mincut_gated_transformer::mamba::{MambaConfig, MambaLayer};
let config = MambaConfig::default();
let mut layer = MambaLayer::new(config);
// Recurrent mode (O(1) memory per step)
for token in tokens.iter() {
let output = layer.step_recurrent(token);
}
// Batch mode for training
let outputs = layer.forward_sequence(&input_sequence);
use ruvector_mincut_gated_transformer::speculative::{
SpeculativeConfig, SpeculativeDecoder,
};
let config = SpeculativeConfig {
max_draft_tokens: 8,
tree_width: 4,
acceptance_threshold: 0.9,
lambda_guidance: true, // Use mincut λ for tree construction
};
let mut decoder = SpeculativeDecoder::new(config, &gate_policy);
// Generate with speculation (3-5× faster)
let (tokens, stats) = decoder.generate_with_speculation(
&draft_model,
&target_model,
&prompt,
max_new_tokens,
);
Determinism guarantee: For fixed (weights, config, policy, input), inference always produces identical (logits, witness).
Safety properties:
Witness fields:
witness.decision // ALLOW, DEFER, QUARANTINE, SKIP
witness.reason // Why this decision was made
witness.external_writes_enabled // Safe to persist?
witness.kv_action // WRITE, FREEZE, FLUSH
Licensed under either of Apache License 2.0 or MIT license at your option.
Contributions welcome! Areas of interest: