data-doctor-core

Crates.iodata-doctor-core
lib.rsdata-doctor-core
version1.0.4
created_at2025-12-08 15:26:07.721933+00
updated_at2025-12-21 07:33:47.121673+00
descriptionA powerful data validation and cleaning tool for JSON and CSV files
homepage
repositoryhttps://github.com/jeevanms003/data-doctor
max_upload_size
id1973766
size106,444
Jeevan M Swamy (jeevanms003)

documentation

README

DataDoctor Core 🩺

Crates.io License: MIT Documentation

DataDoctor Core is the intelligent engine behind the DataDoctor toolkit. It is a high-performance Rust library designed to valid, diagnose, and automatically fix common data quality issues in JSON and CSV datasets.

Unlike simple validators that just say "Error at line 5", DataDoctor Core attempts to understand the error and repair it using a combination of heuristic parsing, token stream analysis, and rule-based correction.


🧠 How It Works

DataDoctor Core operates using two main strategies:

1. The JSON Repair Engine

For JSON data, we don't just use a standard parser (which fails immediately on errors). Instead, we implement a custom, fault-tolerant token stream analyzer that can:

  • Lookahead/Lookbehind: To detect trailing commas or missing commas.
  • Context Awareness: To know if a quote is missing from a key or a value.
  • Structural Repair: To balance unclosed braces {} and brackets [].

2. The CSV Normalizer

For CSV files, the engine handles the complexities of delimiters and column alignment:

  • Delimiter Detection: statistical analysis of the first few lines to guess if it's ,, ;, \t, or |.
  • Column Padding: Auto-fills missing fields with empty values to preserve row structure.
  • Type Coercion: Smartly converts "Yes"/"No" to true/false, validates emails, and normalizes headers.

✨ Features

  • Robust Validation: Detailed error reporting with row/column locations and specific error codes.
  • Auto-Fixing:
    • JSON: Trailing commas, missing quotes, single quotes -> double quotes, unclosed brackets.
    • CSV: Padding missing columns, trimming extra columns, boolean normalization, whitespace trimming.
  • Schema Validation: Define optional schemas to enforce data types (Integer, Email, URL, etc.) and required fields.
  • Streaming Architecture: Designed to handle large files efficiently using Read streams.

📦 Installation

Add this to your Cargo.toml:

[dependencies]
data-doctor-core = "1.0"

📖 Usage Guide

1. Basic JSON Validation & Fixing

Use JsonValidator to fix broken JSON strings.

use data_doctor_core::json::JsonValidator;
use data_doctor_core::ValidationOptions;

fn main() {
    // Broken JSON: Trailing comma, single quotes, unquoted key
    let bad_json = r#"{ name: 'John Doe', age: 30, }"#;

    let mut options = ValidationOptions::default();
    options.auto_fix = true; // Enable the repair engine

    let validator = JsonValidator::new();
    let (fixed_json, result) = validator.validate_and_fix(bad_json, &options);

    if result.success {
        println!("Fixed: {}", fixed_json);
        // Output: { "name": "John Doe", "age": 30 }
    }
}

2. Streaming CSV Validation

Validate large CSV files efficiently using validate_csv_stream.

use data_doctor_core::{validate_csv_stream, ValidationOptions};
use std::fs::File;
use std::io::BufReader;

fn main() -> std::io::Result<()> {
    let file = File::open("data.csv")?;
    let reader = BufReader::new(file);

    let options = ValidationOptions {
        csv_delimiter: b',',
        max_errors: 100, // Stop after 100 errors
        auto_fix: false, // Just validate, don't fix
        ..Default::default()
    };

    let result = validate_csv_stream(reader, &options);

    println!("processed {} records", result.stats.total_records);
    println!("found {} invalid records", result.stats.invalid_records);
    
    // Inspect specific issues
    for issue in result.issues {
        println!("[{}] Row {}: {}", issue.severity, issue.row.unwrap_or(0), issue.message);
    }

    Ok(())
}

3. Enforcing a Schema

You can define a Schema to ensure data meets specific requirements.

use data_doctor_core::schema::{Schema, FieldSchema, DataType, Constraint};
use data_doctor_core::ValidationOptions;

fn main() {
    let mut schema = Schema::new("user_profile");
    
    // Define fields
    schema.add_field(FieldSchema::new("email", DataType::Email)
        .add_constraint(Constraint::Required));
        
    schema.add_field(FieldSchema::new("age", DataType::Integer));

    let mut options = ValidationOptions::default();
    options.schema = Some(schema);

    // Now validate your data against this schema...
}

⚙️ Configuration (ValidationOptions)

Option Type Default Description
auto_fix bool false If true, the engine attempts to repair detected issues.
max_errors usize 0 (unlimited) Stop processing after finding N errors (useful for large files).
csv_delimiter u8 b',' The delimiter character for CSV parsing.
schema Option<Schema> None Optional data schema for stricter validation.

🤝 Contributing

Contributions are welcome! Please check out the main repository for guidelines.

📄 License

MIT License.

Commit count: 0

cargo fmt