| Crates.io | term-guard |
| lib.rs | term-guard |
| version | 0.0.2 |
| created_at | 2025-07-30 04:28:01.543157+00 |
| updated_at | 2025-10-02 17:22:14.924203+00 |
| description | A Rust data validation library providing Deequ-like capabilities without Spark dependencies |
| homepage | https://github.com/withterm/term |
| repository | https://github.com/withterm/term |
| max_upload_size | |
| id | 1773105 |
| size | 1,736,641 |
Bulletproof data validation without the infrastructure headache.
Get Started β’ Documentation β’ Examples β’ API Reference
Every data pipeline is a ticking time bomb. Null values crash production. Duplicate IDs corrupt databases. Format changes break downstream systems. Yet most teams discover these issues only after the damage is done.
Traditional data validation tools assume you have a data team, a Spark cluster, and weeks to implement. Term takes a different approach:
Term is data validation for the 99% of engineering teams who just want their data to work.
# Add to your Cargo.toml
cargo add term-guard tokio --features tokio/full
use term_guard::prelude::*;
#[tokio::main]
async fn main() -> Result<()> {
// Load your data
let ctx = SessionContext::new();
ctx.register_csv("users", "users.csv", CsvReadOptions::new()).await?;
// Define what good data looks like
let checks = ValidationSuite::builder("User Data Quality")
.check(
Check::builder("No broken data")
.is_complete("user_id") // No missing IDs
.is_unique("email") // No duplicate emails
.has_pattern("email", r"@", 1.0) // All emails have @
.build()
)
.build();
// Validate and get instant feedback
let report = checks.run(&ctx).await?;
println!("{}", report); // β
All 3 checks passed!
Ok(())
}
That's it! No clusters to manage, no JVMs to tune, no YAML to write.
// Validate a production dataset with multiple quality checks
let suite = ValidationSuite::builder("Production Pipeline")
.check(
Check::builder("Data Freshness")
.satisfies("created_at > now() - interval '1 day'")
.has_size(Assertion::GreaterThan(1000))
.build()
)
.check(
Check::builder("Business Rules")
.has_min("revenue", Assertion::GreaterThan(0.0))
.has_mean("conversion_rate", Assertion::Between(0.01, 0.10))
.has_correlation("ad_spend", "revenue", Assertion::GreaterThan(0.5))
.build()
)
.build();
// Runs all checks in a single optimized pass
let report = suite.run(&ctx).await?;
use term_guard::analyzers::{IncrementalAnalysisRunner, FilesystemStateStore};
// Initialize with state persistence
let store = FilesystemStateStore::new("./metrics_state");
let runner = IncrementalAnalysisRunner::new(store);
// Process daily partitions incrementally
let state = runner.analyze_partition(
&ctx,
"2025-09-30", // Today's partition
vec![analyzer],
).await?;
// Only new data is processed, previous results are reused!
use term_guard::analyzers::{KllSketchAnalyzer, CorrelationAnalyzer};
// Approximate quantiles with minimal memory
let kll = KllSketchAnalyzer::new("response_time")
.with_k(256) // Higher k = better accuracy
.with_quantiles(vec![0.5, 0.95, 0.99]);
// Detect relationships between metrics
let correlation = CorrelationAnalyzer::new("ad_spend", "revenue")
.with_method(CorrelationMethod::Spearman); // Handles non-linear
let results = runner.run_analyzers(vec![kll, correlation]).await?;
// Validate relationships across tables with fluent API
let suite = ValidationSuite::builder("Cross-table integrity")
.check(
Check::builder("Referential integrity")
.foreign_key("orders.customer_id", "customers.id")
.temporal_consistency("orders", "created_at", "updated_at")
.build()
)
.build();
// New format validators including SSN detection
let check = Check::builder("PII Protection")
.contains_ssn("ssn_field") // Validates SSN format
.contains_credit_card("cc_field") // Credit card detection
.contains_email("email_field") // Email validation
.build();
use term_guard::analyzers::{AnomalyDetector, RelativeRateOfChangeStrategy};
// Detect sudden metric changes
let detector = AnomalyDetector::new()
.with_strategy(RelativeRateOfChangeStrategy::new()
.max_rate_increase(0.5) // Flag 50%+ increases
.max_rate_decrease(0.3) // Flag 30%+ decreases
);
let anomalies = detector.detect(&historical_metrics, ¤t_metric)?;
use term_guard::analyzers::{GroupedCompletenessAnalyzer};
// Analyze data quality by segment
let analyzer = GroupedCompletenessAnalyzer::new()
.group_by(vec!["region", "product_category"])
.analyze_column("revenue");
// Get metrics for each group combination
let results = analyzer.compute(&ctx).await?;
// e.g., completeness for region=US & category=Electronics
Don't know what to validate? Term can analyze your data and suggest constraints automatically:
use term_guard::analyzers::{ColumnProfiler, SuggestionEngine};
use term_guard::analyzers::{CompletenessRule, UniquenessRule, PatternRule, RangeRule};
// Profile your data
let profiler = ColumnProfiler::new();
let profile = profiler.profile_column(&ctx, "users", "email").await?;
// Get intelligent suggestions
let engine = SuggestionEngine::new()
.add_rule(Box::new(CompletenessRule::new())) // Suggests null checks
.add_rule(Box::new(UniquenessRule::new())) // Finds potential keys
.add_rule(Box::new(PatternRule::new())) // Detects email/phone patterns
.add_rule(Box::new(RangeRule::new())) // Recommends numeric bounds
.confidence_threshold(0.8);
let suggestions = engine.suggest_constraints(&profile);
// Example output:
// β Suggested: is_complete (confidence: 0.90)
// Rationale: Column is 99.8% complete, suggesting completeness constraint
// β Suggested: is_unique (confidence: 0.95)
// Rationale: Column has 99.9% unique values, suggesting uniqueness constraint
// β Suggested: matches_email_pattern (confidence: 0.85)
// Rationale: Sample values suggest email format
Term analyzes your actual data patterns to recommend the most relevant quality checks!
π Data Quality
|
π Statistical
|
π‘οΈ Security
|
π Performance
|
Dataset: 1M rows, 20 constraints
Without optimizer: 3.2s (20 full scans)
With Term: 0.21s (2 optimized scans)
v0.0.2 Performance Improvements:
- 30-50% faster CI/CD with cargo-nextest
- Memory-efficient KLL sketches for quantile computation
- SQL window functions for correlation analysis
- Cached SessionContext for test speedup
- Comprehensive benchmark suite for regression detection
[dependencies]
term-guard = "0.0.2"
tokio = { version = "1", features = ["full"] }
# Optional features
term-guard = { version = "0.0.2", features = ["cloud-storage"] } # S3, GCS, Azure support
Check out the examples/ directory for real-world scenarios:
basic_validation.rs - Simple CSV validationcloud_storage_example.rs - Validate S3/GCS datatelemetry_example.rs - Production monitoringtpc_h_validation.rs - Complex business rulesincremental_analysis.rs - Incremental computationanomaly_detection_strategy.rs - Anomaly detectiongrouped_metrics.rs - Segment-level analysisOur documentation is organized using the DiΓ‘taxis framework:
We love contributions! Term is built by the community, for the community.
# Get started in 3 steps
git clone https://github.com/withterm/term.git
cd term
cargo test
Term is MIT licensed. See LICENSE for details.
Term stands on the shoulders of giants:
Ready to bulletproof your data pipelines?
β‘ Get Started β’ π Read the Docs β’ π¬ Join Community