data-doctor-core

Crates.io	data-doctor-core
lib.rs	data-doctor-core
version	1.0.4
created_at	2025-12-08 15:26:07.721933+00
updated_at	2025-12-21 07:33:47.121673+00
description	A powerful data validation and cleaning tool for JSON and CSV files
homepage
repository	https://github.com/jeevanms003/data-doctor
max_upload_size
id	1973766
size	106,444

Jeevan M Swamy (jeevanms003)

documentation

README

DataDoctor Core 🩺

DataDoctor Core is the intelligent engine behind the DataDoctor toolkit. It is a high-performance Rust library designed to valid, diagnose, and automatically fix common data quality issues in JSON and CSV datasets.

Unlike simple validators that just say "Error at line 5", DataDoctor Core attempts to understand the error and repair it using a combination of heuristic parsing, token stream analysis, and rule-based correction.

🧠 How It Works

DataDoctor Core operates using two main strategies:

1. The JSON Repair Engine

For JSON data, we don't just use a standard parser (which fails immediately on errors). Instead, we implement a custom, fault-tolerant token stream analyzer that can:

Lookahead/Lookbehind: To detect trailing commas or missing commas.
Context Awareness: To know if a quote is missing from a key or a value.
Structural Repair: To balance unclosed braces {} and brackets [].

2. The CSV Normalizer

For CSV files, the engine handles the complexities of delimiters and column alignment:

Delimiter Detection: statistical analysis of the first few lines to guess if it's ,, ;, \t, or |.
Column Padding: Auto-fills missing fields with empty values to preserve row structure.
Type Coercion: Smartly converts "Yes"/"No" to true/false, validates emails, and normalizes headers.

✨ Features

Robust Validation: Detailed error reporting with row/column locations and specific error codes.
Auto-Fixing:
- JSON: Trailing commas, missing quotes, single quotes -> double quotes, unclosed brackets.
- CSV: Padding missing columns, trimming extra columns, boolean normalization, whitespace trimming.
Schema Validation: Define optional schemas to enforce data types (Integer, Email, URL, etc.) and required fields.
Streaming Architecture: Designed to handle large files efficiently using Read streams.

📦 Installation

Add this to your Cargo.toml:

[dependencies]
data-doctor-core = "1.0"

📖 Usage Guide

1. Basic JSON Validation & Fixing

Use JsonValidator to fix broken JSON strings.

use data_doctor_core::json::JsonValidator;
use data_doctor_core::ValidationOptions;

fn main() {
    // Broken JSON: Trailing comma, single quotes, unquoted key
    let bad_json = r#"{ name: 'John Doe', age: 30, }"#;

    let mut options = ValidationOptions::default();
    options.auto_fix = true; // Enable the repair engine

    let validator = JsonValidator::new();
    let (fixed_json, result) = validator.validate_and_fix(bad_json, &options);

    if result.success {
        println!("Fixed: {}", fixed_json);
        // Output: { "name": "John Doe", "age": 30 }
    }
}

2. Streaming CSV Validation

Validate large CSV files efficiently using validate_csv_stream.

use data_doctor_core::{validate_csv_stream, ValidationOptions};
use std::fs::File;
use std::io::BufReader;

fn main() -> std::io::Result<()> {
    let file = File::open("data.csv")?;
    let reader = BufReader::new(file);

    let options = ValidationOptions {
        csv_delimiter: b',',
        max_errors: 100, // Stop after 100 errors
        auto_fix: false, // Just validate, don't fix
        ..Default::default()
    };

    let result = validate_csv_stream(reader, &options);

    println!("processed {} records", result.stats.total_records);
    println!("found {} invalid records", result.stats.invalid_records);
    
    // Inspect specific issues
    for issue in result.issues {
        println!("[{}] Row {}: {}", issue.severity, issue.row.unwrap_or(0), issue.message);
    }

    Ok(())
}

3. Enforcing a Schema

You can define a Schema to ensure data meets specific requirements.

use data_doctor_core::schema::{Schema, FieldSchema, DataType, Constraint};
use data_doctor_core::ValidationOptions;

fn main() {
    let mut schema = Schema::new("user_profile");
    
    // Define fields
    schema.add_field(FieldSchema::new("email", DataType::Email)
        .add_constraint(Constraint::Required));
        
    schema.add_field(FieldSchema::new("age", DataType::Integer));

    let mut options = ValidationOptions::default();
    options.schema = Some(schema);

    // Now validate your data against this schema...
}

⚙️ Configuration (`ValidationOptions`)

Option	Type	Default	Description
`auto_fix`	`bool`	`false`	If `true`, the engine attempts to repair detected issues.
`max_errors`	`usize`	`0` (unlimited)	Stop processing after finding N errors (useful for large files).
`csv_delimiter`	`u8`	`b','`	The delimiter character for CSV parsing.
`schema`	`Option<Schema>`	`None`	Optional data schema for stricter validation.

🤝 Contributing

Contributions are welcome! Please check out the main repository for guidelines.

📄 License

MIT License.

Commit count: 0