| Crates.io | tensorlogic-trustformers |
| lib.rs | tensorlogic-trustformers |
| version | 0.1.0-alpha.2 |
| created_at | 2025-11-07 22:43:59.088667+00 |
| updated_at | 2026-01-03 21:05:39.538506+00 |
| description | Transformer-as-rules: Self-attention and FFN layers as einsum expressions |
| homepage | https://github.com/cool-japan/tensorlogic |
| repository | https://github.com/cool-japan/tensorlogic |
| max_upload_size | |
| id | 1922303 |
| size | 677,041 |
Transformer architectures as TensorLogic einsum graphs
This crate provides implementations of transformer components (self-attention, multi-head attention, feed-forward networks) as einsum operations that compile to TensorLogic IR and execute on any TensorLogic backend.
use tensorlogic_trustformers::{
AttentionConfig, SelfAttention, MultiHeadAttention,
FeedForwardConfig, FeedForward,
};
use tensorlogic_ir::EinsumGraph;
// Configure and build self-attention
let attn_config = AttentionConfig::new(512, 8).unwrap();
let self_attn = SelfAttention::new(attn_config).unwrap();
let mut graph = EinsumGraph::new();
graph.add_tensor("Q");
graph.add_tensor("K");
graph.add_tensor("V");
let outputs = self_attn.build_attention_graph(&mut graph).unwrap();
// Configure multi-head attention
let mha_config = AttentionConfig::new(512, 8).unwrap();
let mha = MultiHeadAttention::new(mha_config).unwrap();
// Configure feed-forward network
let ffn_config = FeedForwardConfig::new(512, 2048)
.with_activation("gelu")
.with_dropout(0.1);
let ffn = FeedForward::new(ffn_config).unwrap();
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Einsum breakdown:
einsum("bqd,bkd->bqk", Q, K)scores / sqrt(d_k)softmax(scores, axis=-1)einsum("bqk,bkv->bqv", attn, V)Where:
b = batch dimensionq = query sequence lengthk = key sequence lengthd = model dimensionv = value dimensionMulti-head attention splits the model dimension into parallel attention heads:
1. Reshape: [B, S, D] -> [B, H, S, D_k] where D_k = D/H
2. Attention per head: einsum("bhqd,bhkd->bhqk", Q, K)
3. Scale and softmax
4. Apply to values: einsum("bhqk,bhkv->bhqv", attn, V)
5. Concatenate heads: [B, H, S, D_k] -> [B, S, D]
Position-wise feed-forward network with two linear transformations:
FFN(x) = activation(xW1 + b1)W2 + b2
Einsum notation:
einsum("bsd,df->bsf", x, W1)activation(h1) (GELU, ReLU, etc.)einsum("bsf,fd->bsd", h2, W2)Where:
d = d_modelf = d_ff (typically 4 * d_model)use tensorlogic_trustformers::AttentionConfig;
let config = AttentionConfig::new(512, 8)?
.with_causal(true) // Enable causal masking
.with_dropout(0.1); // Set dropout probability
assert_eq!(config.d_model, 512);
assert_eq!(config.n_heads, 8);
assert_eq!(config.d_k, 64); // Automatically computed
use tensorlogic_trustformers::FeedForwardConfig;
let config = FeedForwardConfig::new(512, 2048)
.with_activation("gelu") // or "relu", "silu", etc.
.with_dropout(0.1);
assert_eq!(config.d_model, 512);
assert_eq!(config.d_ff, 2048);
use tensorlogic_trustformers::TransformerLayerConfig;
let config = TransformerLayerConfig::new(512, 8, 2048)?
.with_pre_norm(true); // Use pre-layer normalization
assert!(config.validate().is_ok());
use tensorlogic_trustformers::SelfAttention;
use tensorlogic_ir::EinsumGraph;
let attn = SelfAttention::new(config)?;
let mut graph = EinsumGraph::new();
// Add input tensors (Q, K, V)
graph.add_tensor("Q"); // [batch, seq, d_model]
graph.add_tensor("K"); // [batch, seq, d_model]
graph.add_tensor("V"); // [batch, seq, d_model]
// Build attention graph
let outputs = attn.build_attention_graph(&mut graph)?;
// outputs[0] = attention output [batch, seq, d_model]
use tensorlogic_trustformers::MultiHeadAttention;
let mha = MultiHeadAttention::new(config)?;
let mut graph = EinsumGraph::new();
graph.add_tensor("Q");
graph.add_tensor("K");
graph.add_tensor("V");
let outputs = mha.build_mha_graph(&mut graph)?;
use tensorlogic_trustformers::FeedForward;
let ffn = FeedForward::new(config)?;
let mut graph = EinsumGraph::new();
// Add input tensors
graph.add_tensor("x"); // [batch, seq, d_model]
graph.add_tensor("W1"); // [d_model, d_ff]
graph.add_tensor("b1"); // [d_ff]
graph.add_tensor("W2"); // [d_ff, d_model]
graph.add_tensor("b2"); // [d_model]
let outputs = ffn.build_ffn_graph(&mut graph)?;
GLU-style networks use element-wise gating for improved capacity:
use tensorlogic_trustformers::GatedFeedForward;
let glu = GatedFeedForward::new(config)?;
let mut graph = EinsumGraph::new();
graph.add_tensor("x");
graph.add_tensor("W_gate");
graph.add_tensor("W_value");
graph.add_tensor("W_out");
let outputs = glu.build_glu_graph(&mut graph)?;
Formula: GLU(x) = σ(xW_gate) ⊙ activation(xW_value) W_out
The einsum graphs produced by this crate integrate seamlessly with the TensorLogic ecosystem:
use tensorlogic_compiler::CompilerContext;
let mut ctx = CompilerContext::new();
// Compile TLExpr rules that use transformer operations
use tensorlogic_scirs_backend::Scirs2Executor;
let executor = Scirs2Executor::new();
// Execute the transformer graph on SciRS2 backend
use tensorlogic_ir::graph::optimization::optimize_graph;
let stats = optimize_graph(&mut graph)?;
// Apply dead code elimination, CSE, etc.
This crate follows core TensorLogic principles:
See the examples directory for complete examples:
01_basic_encoder.rs - Basic transformer encoder usage02_trustformers_integration.rs - TrustformeRS integration03_rule_based_attention.rs - Rule-based attention patterns04_sparse_attention.rs - Sparse attention for long sequences05_gradient_checkpointing.rs - Memory-efficient training strategies06_kv_cache_inference.rs - Fast autoregressive generation with KV-cacheRun the test suite:
cargo nextest run -p tensorlogic-trustformers
All 229 tests should pass with zero warnings.
Run performance benchmarks:
cargo bench --bench model_benchmarks
This will generate HTML reports in target/criterion/ with detailed performance metrics.
The einsum-based approach enables:
See TODO.md for the development roadmap. Current status: 100% complete 🎉
This crate is part of the TensorLogic project and is licensed under Apache-2.0.
Three types of position encodings for sequence modeling:
use tensorlogic_trustformers::{PositionEncodingConfig, SinusoidalPositionEncoding};
// Sinusoidal (fixed) encoding
let config = PositionEncodingConfig::sinusoidal(512, 2048);
let pe = SinusoidalPositionEncoding::new(config).unwrap();
// Learned position embeddings
let config = PositionEncodingConfig::learned(512, 2048);
let pe = LearnedPositionEncoding::new(config).unwrap();
// Relative position encoding
let config = PositionEncodingConfig::relative(512, 32, 128);
let pe = RelativePositionEncoding::new(config).unwrap();
Standard LayerNorm and efficient RMSNorm:
use tensorlogic_trustformers::{LayerNormConfig, LayerNorm, RMSNorm};
// Standard layer normalization
let config = LayerNormConfig::new(512).with_eps(1e-6);
let ln = LayerNorm::new(config).unwrap();
// RMS normalization (more efficient)
let rms = RMSNorm::new(config).unwrap();
Full encoder and decoder layers with residual connections:
use tensorlogic_trustformers::{EncoderLayerConfig, EncoderLayer};
// Encoder layer with pre-normalization
let config = EncoderLayerConfig::new(512, 8, 2048)?
.with_pre_norm(true)
.with_dropout(0.1);
let encoder = EncoderLayer::new(config)?;
// Decoder layer with causal masking
let decoder_config = DecoderLayerConfig::new(512, 8, 2048)?;
let decoder = DecoderLayer::new(decoder_config)?;
Multi-layer transformer architectures:
use tensorlogic_trustformers::{EncoderStackConfig, EncoderStack};
// 6-layer transformer encoder
let config = EncoderStackConfig::new(6, 512, 8, 2048, 1024)?
.with_dropout(0.1)
.with_final_layer_norm(true);
let encoder_stack = EncoderStack::new(config)?;
// Build complete encoder graph
let mut graph = EinsumGraph::new();
graph.add_tensor("input");
let outputs = encoder_stack.build_encoder_stack_graph(&mut graph)?;
Integrate logical rules with attention mechanisms:
use tensorlogic_trustformers::{RuleAttentionConfig, RuleBasedAttention};
use tensorlogic_trustformers::rule_attention::patterns;
// Hard constraint: only attend where rule is satisfied
let base_attn = AttentionConfig::new(512, 8)?;
let config = RuleAttentionConfig::hard(base_attn);
let rule = patterns::syntactic_dependency("head", "dep");
let attn = RuleBasedAttention::new(config)?.with_rule(rule);
// Soft constraint: bias attention towards rule-satisfying positions
let config = RuleAttentionConfig::soft(base_attn, 0.7);
// Gated: interpolate between content and rule attention
let config = RuleAttentionConfig::gated(base_attn, 0.5);
Memory-efficient training for large models:
use tensorlogic_trustformers::{CheckpointConfig, EncoderStackConfig};
// Create a large model
let config = EncoderStackConfig::new(12, 768, 12, 3072, 512)?;
// Uniform checkpointing: checkpoint every 2 layers
let checkpoint = CheckpointConfig::uniform(2);
println!("Memory savings: {:.1}%", checkpoint.memory_savings(12) * 100.0);
println!("Compute overhead: {:.2}x", checkpoint.compute_overhead(12));
// Selective checkpointing: checkpoint specific layers
let checkpoint = CheckpointConfig::selective(vec![0, 3, 6, 9]);
// Dynamic checkpointing: automatically balance memory vs. compute
let checkpoint = CheckpointConfig::dynamic(12, 0.3)?; // Target 30% memory usage
// Customize what to checkpoint
let checkpoint = CheckpointConfig::uniform(2)
.with_checkpoint_attention(true) // Checkpoint attention
.with_checkpoint_ffn(false); // Don't checkpoint FFN
Benefits:
Enable efficient autoregressive generation with dramatic speedups:
use tensorlogic_trustformers::{KVCache, KVCacheConfig};
// Create cache for 12-layer model (GPT-2 small)
let mut cache = KVCache::new(12, 12, 64);
// During autoregressive generation
for step in 0..100 {
// Compute keys/values only for new token
let keys = compute_keys_for_new_token(); // [batch, heads, 1, dim]
let values = compute_values_for_new_token(); // [batch, heads, 1, dim]
// Update cache for all layers
for layer_idx in 0..12 {
cache.update_layer(layer_idx, keys.clone(), values.clone())?;
}
// Retrieve cached keys/values for attention
let (all_keys, all_values) = cache.get_layer(0)?;
// Compute attention only over new position
// ... (attention computation using cached K,V)
cache.next_step();
}
// Monitor cache usage
let stats = cache.stats();
println!("{}", stats.summary());
// CacheStats:
// Layers: 12
// Seq len: 100
// Memory: 7.0/4608.0 MB (0.2%)
// Step: 100
// Enabled: true
Performance Impact:
Configuration Options:
// Custom cache configuration
let config = KVCacheConfig::new(24, 16, 64) // GPT-2 large
.with_max_seq_len(4096) // Support longer contexts
.with_max_batch_size(64) // Larger batch inference
.with_enabled(true); // Enable/disable dynamically
let cache = KVCache::from_config(config)?;
// Memory estimation
println!("Max memory: {:.1} MB", config.memory_usage_mb());
Efficient attention for long sequences:
use tensorlogic_trustformers::{SparseAttentionConfig, SparseAttention, LocalAttention};
// Strided sparse attention (attend every k-th position)
let base_attn = AttentionConfig::new(512, 8)?;
let config = SparseAttentionConfig::strided(base_attn, 4)?;
let sparse = SparseAttention::new(config)?;
// Local windowed attention
let config = SparseAttentionConfig::local(base_attn, 128)?;
let sparse = SparseAttention::new(config)?;
// Or use dedicated LocalAttention for efficiency
let local = LocalAttention::new(base_attn, 64)?;
println!("Memory savings: {:.1}%", local.memory_savings(1024) * 100.0);
Helper functions for model analysis:
use tensorlogic_trustformers::utils::{encoder_stack_stats, presets};
// Get model statistics
let config = presets::gpt2_small();
let stats = encoder_stack_stats(&config);
println!("{}", stats.summary());
// Output: ModelStats:
// Total params: 117.00M
// Trainable: 117.00M
// Layers: 12
// d_model: 768
// Memory: 468 MB
// Use preset configurations
let gpt2 = presets::gpt2_small();
let bert = presets::bert_base();
let (encoder, decoder) = presets::transformer_base();
Status: 🎉 Production Ready (v0.1.0-alpha.2) **Last Updated: 2025-12-16 Tests: 229/229 passing (100%) Examples: 6 comprehensive examples Benchmarks: Criterion suite with HTML reports Features: Complete transformer implementation with optimizations Part of: TensorLogic Ecosystem