| Crates.io | data-doctor-core |
| lib.rs | data-doctor-core |
| version | 1.0.4 |
| created_at | 2025-12-08 15:26:07.721933+00 |
| updated_at | 2025-12-21 07:33:47.121673+00 |
| description | A powerful data validation and cleaning tool for JSON and CSV files |
| homepage | |
| repository | https://github.com/jeevanms003/data-doctor |
| max_upload_size | |
| id | 1973766 |
| size | 106,444 |
DataDoctor Core is the intelligent engine behind the DataDoctor toolkit. It is a high-performance Rust library designed to valid, diagnose, and automatically fix common data quality issues in JSON and CSV datasets.
Unlike simple validators that just say "Error at line 5", DataDoctor Core attempts to understand the error and repair it using a combination of heuristic parsing, token stream analysis, and rule-based correction.
DataDoctor Core operates using two main strategies:
For JSON data, we don't just use a standard parser (which fails immediately on errors). Instead, we implement a custom, fault-tolerant token stream analyzer that can:
{} and brackets [].For CSV files, the engine handles the complexities of delimiters and column alignment:
,, ;, \t, or |.true/false, validates emails, and normalizes headers.Read streams.Add this to your Cargo.toml:
[dependencies]
data-doctor-core = "1.0"
Use JsonValidator to fix broken JSON strings.
use data_doctor_core::json::JsonValidator;
use data_doctor_core::ValidationOptions;
fn main() {
// Broken JSON: Trailing comma, single quotes, unquoted key
let bad_json = r#"{ name: 'John Doe', age: 30, }"#;
let mut options = ValidationOptions::default();
options.auto_fix = true; // Enable the repair engine
let validator = JsonValidator::new();
let (fixed_json, result) = validator.validate_and_fix(bad_json, &options);
if result.success {
println!("Fixed: {}", fixed_json);
// Output: { "name": "John Doe", "age": 30 }
}
}
Validate large CSV files efficiently using validate_csv_stream.
use data_doctor_core::{validate_csv_stream, ValidationOptions};
use std::fs::File;
use std::io::BufReader;
fn main() -> std::io::Result<()> {
let file = File::open("data.csv")?;
let reader = BufReader::new(file);
let options = ValidationOptions {
csv_delimiter: b',',
max_errors: 100, // Stop after 100 errors
auto_fix: false, // Just validate, don't fix
..Default::default()
};
let result = validate_csv_stream(reader, &options);
println!("processed {} records", result.stats.total_records);
println!("found {} invalid records", result.stats.invalid_records);
// Inspect specific issues
for issue in result.issues {
println!("[{}] Row {}: {}", issue.severity, issue.row.unwrap_or(0), issue.message);
}
Ok(())
}
You can define a Schema to ensure data meets specific requirements.
use data_doctor_core::schema::{Schema, FieldSchema, DataType, Constraint};
use data_doctor_core::ValidationOptions;
fn main() {
let mut schema = Schema::new("user_profile");
// Define fields
schema.add_field(FieldSchema::new("email", DataType::Email)
.add_constraint(Constraint::Required));
schema.add_field(FieldSchema::new("age", DataType::Integer));
let mut options = ValidationOptions::default();
options.schema = Some(schema);
// Now validate your data against this schema...
}
ValidationOptions)| Option | Type | Default | Description |
|---|---|---|---|
auto_fix |
bool |
false |
If true, the engine attempts to repair detected issues. |
max_errors |
usize |
0 (unlimited) |
Stop processing after finding N errors (useful for large files). |
csv_delimiter |
u8 |
b',' |
The delimiter character for CSV parsing. |
schema |
Option<Schema> |
None |
Optional data schema for stricter validation. |
Contributions are welcome! Please check out the main repository for guidelines.
MIT License.