ruvector-data-framework

Crates.ioruvector-data-framework
lib.rsruvector-data-framework
version0.3.0
created_at2026-01-05 18:51:20.131655+00
updated_at2026-01-05 19:50:27.118526+00
descriptionCore discovery framework for RuVector dataset integrations - find hidden patterns in massive datasets using vector memory, graph structures, and dynamic min-cut algorithms
homepage
repositoryhttps://github.com/ruvnet/ruvector
max_upload_size
id2024360
size2,006,185
rUv (ruvnet)

documentation

https://docs.rs/ruvector-data-framework

README

RuVector Dataset Discovery Framework

Find hidden patterns and connections in massive datasets that traditional tools miss.

RuVector turns your dataβ€”research papers, climate records, financial filingsβ€”into a connected graph, then uses cutting-edge algorithms to spot emerging trends, cross-domain relationships, and regime shifts before they become obvious.

Why RuVector?

Most data analysis tools excel at answering questions you already know to ask. RuVector is different: it helps you discover what you don't know you're looking for.

Real-world examples:

  • πŸ”¬ Research: Spot a new field forming 6-12 months before it gets a name, by detecting when papers start citing across traditional boundaries
  • 🌍 Climate: Detect regime shifts in weather patterns that correlate with economic disruptions
  • πŸ’° Finance: Find companies whose narratives are diverging from their peersβ€”often an early warning signal

Features

Feature What It Does Why It Matters
Vector Memory Stores data as 384-1536 dim embeddings Similar concepts cluster together automatically
HNSW Index O(log n) approximate nearest neighbor search 10-50x faster than brute force for large datasets
Graph Structure Connects related items with weighted edges Reveals hidden relationships in your data
Min-Cut Analysis Measures how "connected" your network is Detects regime changes and fragmentation
Cross-Domain Detection Finds bridges between different fields Discovers unexpected correlations (e.g., climate β†’ finance)
ONNX Embeddings Neural semantic embeddings (MiniLM, BGE, etc.) Production-quality text understanding
Causality Testing Checks if changes in X predict changes in Y Moves beyond correlation to actionable insights
Statistical Rigor Reports p-values and effect sizes Know which findings are real vs. noise

What's New in v0.3.0

  • HNSW Integration: O(n log n) similarity search replaces O(nΒ²) brute force
  • Similarity Cache: 2-3x speedup for repeated similarity queries
  • Batch ONNX Embeddings: Chunked processing with progress callbacks
  • Shared Utils Module: cosine_similarity, euclidean_distance, normalize_vector
  • Auto-connect by Embeddings: CoherenceEngine creates edges from vector similarity

Performance

  • ⚑ 10-50x faster similarity search (HNSW vs brute force)
  • ⚑ 8.8x faster batch vector insertion (parallel processing)
  • ⚑ 2.9x faster similarity computation (SIMD acceleration)
  • ⚑ 2-3x faster repeated queries (similarity cache)
  • πŸ“Š Works with millions of records on standard hardware

Quick Start

Prerequisites

# Ensure you're in the ruvector workspace
cd /workspaces/ruvector

Run Your First Example

# 1. Performance benchmark - see the speed improvements
cargo run --example optimized_benchmark -p ruvector-data-framework --features parallel --release

# 2. Discovery hunter - find patterns in sample data
cargo run --example discovery_hunter -p ruvector-data-framework --features parallel --release

# 3. Cross-domain analysis - detect bridges between fields
cargo run --example cross_domain_discovery -p ruvector-data-framework --release

Domain-Specific Examples

# Climate: Detect weather regime shifts
cargo run --example regime_detector -p ruvector-data-climate

# Finance: Monitor corporate filing coherence
cargo run --example coherence_watch -p ruvector-data-edgar

What You'll See

πŸ” Discovery Results:
   Pattern: Climate ↔ Finance bridge detected
   Strength: 0.73 (strong connection)
   P-value: 0.031 (statistically significant)

   β†’ Drought indices may predict utility sector
     performance with a 3-period lag

The Discovery Thesis

RuVector's unique combination of vector memory, graph structures, and dynamic minimum cut algorithms enables discoveries that most analysis tools miss:

  • Emerging patterns before they have names: Detect topic splits and merges as cut boundaries shift over time
  • Non-obvious cross-domain bridges: Find small "connector" subgraphs where disciplines quietly start citing each other
  • Causal leverage maps: Link funders, labs, venues, and downstream citations to spot high-impact intervention points
  • Regime shifts in time series: Use coherence breaks to flag fundamental changes in system behavior

Tutorial

1. Creating the Engine

use ruvector_data_framework::optimized::{
    OptimizedDiscoveryEngine, OptimizedConfig,
};
use ruvector_data_framework::ruvector_native::{
    Domain, SemanticVector,
};

let config = OptimizedConfig {
    similarity_threshold: 0.55,   // Minimum cosine similarity
    mincut_sensitivity: 0.10,     // Coherence change threshold
    cross_domain: true,           // Enable cross-domain discovery
    use_simd: true,               // SIMD acceleration
    significance_threshold: 0.05, // P-value threshold
    causality_lookback: 12,       // Temporal lookback periods
    ..Default::default()
};

let mut engine = OptimizedDiscoveryEngine::new(config);

2. Adding Data

use std::collections::HashMap;
use chrono::Utc;

// Single vector
let vector = SemanticVector {
    id: "climate_drought_2024".to_string(),
    embedding: generate_embedding(), // 128-dim vector
    domain: Domain::Climate,
    timestamp: Utc::now(),
    metadata: HashMap::from([
        ("region".to_string(), "sahel".to_string()),
        ("severity".to_string(), "extreme".to_string()),
    ]),
};
let node_id = engine.add_vector(vector);

// Batch insertion (8.8x faster)
#[cfg(feature = "parallel")]
{
    let vectors: Vec<SemanticVector> = load_vectors();
    let node_ids = engine.add_vectors_batch(vectors);
}

3. Computing Coherence

let snapshot = engine.compute_coherence();

println!("Min-cut value: {:.3}", snapshot.mincut_value);
println!("Partition sizes: {:?}", snapshot.partition_sizes);
println!("Boundary nodes: {:?}", snapshot.boundary_nodes);

Interpretation:

Min-cut Trend Meaning
Rising Network consolidating, stronger connections
Falling Fragmentation, potential regime change
Stable Steady state, consistent structure

4. Pattern Detection

let patterns = engine.detect_patterns_with_significance();

for pattern in patterns.iter().filter(|p| p.is_significant) {
    println!("{}", pattern.pattern.description);
    println!("  P-value: {:.4}", pattern.p_value);
    println!("  Effect size: {:.3}", pattern.effect_size);
}

Pattern Types:

Type Description Example
CoherenceBreak Min-cut dropped significantly Network fragmentation crisis
Consolidation Min-cut increased Market convergence
BridgeFormation Cross-domain connections Climate-finance link
Cascade Temporal causality Climate β†’ Finance lag-3
EmergingCluster New dense subgraph Research topic emerging

5. Cross-Domain Analysis

// Check coupling strength
let stats = engine.stats();
let coupling = stats.cross_domain_edges as f64 / stats.total_edges as f64;
println!("Cross-domain coupling: {:.1}%", coupling * 100.0);

// Domain coherence scores
for domain in [Domain::Climate, Domain::Finance, Domain::Research] {
    if let Some(coh) = engine.domain_coherence(domain) {
        println!("{:?}: {:.3}", domain, coh);
    }
}

Performance Benchmarks

Operation Baseline Optimized Speedup
Vector Insertion 133ms 15ms 8.84x
SIMD Cosine 432ms 148ms 2.91x
Pattern Detection 524ms 655ms -

Datasets

1. OpenAlex (Research Intelligence)

Best for: Emerging field detection, cross-discipline bridges

  • 250M+ works, 90M+ authors
  • Native graph structure
  • Bulk download + API access
use ruvector_data_openalex::{OpenAlexConfig, FrontierRadar};

let radar = FrontierRadar::new(OpenAlexConfig::default());
let frontiers = radar.detect_emerging_topics(papers);

2. NOAA + NASA (Climate Intelligence)

Best for: Regime shift detection, anomaly prediction

  • Weather observations, satellite imagery
  • Time series β†’ graph transformation
  • Economic risk modeling
use ruvector_data_climate::{ClimateConfig, RegimeDetector};

let detector = RegimeDetector::new(config);
let shifts = detector.detect_shifts();

3. SEC EDGAR (Financial Intelligence)

Best for: Corporate risk signals, peer divergence

  • XBRL financial statements
  • 10-K/10-Q filings
  • Narrative + fundamental analysis
use ruvector_data_edgar::{EdgarConfig, CoherenceMonitor};

let monitor = CoherenceMonitor::new(config);
let alerts = monitor.analyze_filing(filing);

Directory Structure

examples/data/
β”œβ”€β”€ README.md                 # This file
β”œβ”€β”€ Cargo.toml               # Workspace manifest
β”œβ”€β”€ framework/               # Core discovery framework
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ lib.rs              # Framework exports
β”‚   β”‚   β”œβ”€β”€ ruvector_native.rs  # Native engine with Stoer-Wagner
β”‚   β”‚   β”œβ”€β”€ optimized.rs        # SIMD + parallel optimizations
β”‚   β”‚   β”œβ”€β”€ coherence.rs        # Coherence signal computation
β”‚   β”‚   β”œβ”€β”€ discovery.rs        # Pattern detection
β”‚   β”‚   └── ingester.rs         # Data ingestion
β”‚   └── examples/
β”‚       β”œβ”€β”€ cross_domain_discovery.rs  # Cross-domain patterns
β”‚       β”œβ”€β”€ optimized_benchmark.rs     # Performance comparison
β”‚       └── discovery_hunter.rs        # Novel pattern search
β”œβ”€β”€ openalex/               # OpenAlex integration
β”œβ”€β”€ climate/                # NOAA/NASA integration
└── edgar/                  # SEC EDGAR integration

Configuration Reference

OptimizedConfig

Parameter Default Description
similarity_threshold 0.65 Minimum cosine similarity for edges
mincut_sensitivity 0.12 Sensitivity to coherence changes
cross_domain true Enable cross-domain discovery
batch_size 256 Parallel batch size
use_simd true Enable SIMD acceleration
similarity_cache_size 10000 Max cached similarity pairs
significance_threshold 0.05 P-value threshold
causality_lookback 10 Temporal lookback periods
causality_min_correlation 0.6 Minimum correlation for causality

CoherenceConfig (v0.3.0)

Parameter Default Description
similarity_threshold 0.5 Min similarity for auto-connecting embeddings
use_embeddings true Auto-create edges from embedding similarity
hnsw_k_neighbors 50 Neighbors to search per vector (HNSW)
hnsw_min_records 100 Min records to trigger HNSW (else brute force)
min_edge_weight 0.01 Minimum edge weight threshold
approximate true Use approximate min-cut for speed
parallel true Enable parallel computation

Discovery Examples

Climate-Finance Bridge

Detected: Climate ↔ Finance bridge
  Strength: 0.73
  Connections: 197

Hypothesis: Drought indices may predict
  utility sector performance with lag-2

Regime Shift Detection

Min-cut trajectory:
  t=0: 72.5 (baseline)
  t=1: 73.3 (+1.1%)
  t=2: 74.5 (+1.6%) ← Consolidation

Effect size: 2.99 (large)
P-value: 0.042 (significant)

Causality Pattern

Climate β†’ Finance causality detected
  F-statistic: 4.23
  Optimal lag: 3 periods
  Correlation: 0.67
  P-value: 0.031

Algorithms

HNSW (Hierarchical Navigable Small World)

Approximate nearest neighbor search in high-dimensional spaces.

  • Complexity: O(log n) search, O(log n) insert
  • Use: Fast similarity search for edge creation
  • Parameters: m=16, ef_construction=200, ef_search=50

Stoer-Wagner Min-Cut

Computes minimum cut of weighted undirected graph.

  • Complexity: O(VE + VΒ² log V)
  • Use: Network coherence measurement

SIMD Cosine Similarity

Processes 8 floats per iteration using AVX2.

  • Speedup: 2.9x vs scalar
  • Fallback: Chunked scalar (8 floats per iteration)

Granger Causality

Tests if past values of X predict Y.

  1. Compute cross-correlation at lags 1..k
  2. Find optimal lag with max |correlation|
  3. Calculate F-statistic
  4. Convert to p-value

Best Practices

  1. Start with low thresholds - Use similarity_threshold: 0.45 for exploration
  2. Use batch insertion - add_vectors_batch() is 8x faster
  3. Monitor coherence trends - Min-cut trajectory predicts regime changes
  4. Filter by significance - Focus on p_value < 0.05
  5. Validate causality - Temporal patterns need domain expertise

Troubleshooting

Problem Solution
No patterns detected Lower mincut_sensitivity to 0.05
Too many edges Raise similarity_threshold to 0.70
Slow performance Use --features parallel --release
Memory issues Reduce batch_size

References

License

MIT OR Apache-2.0

Commit count: 729

cargo fmt