csv_polars_cleaner

Crates.iocsv_polars_cleaner
lib.rscsv_polars_cleaner
version0.3.0
created_at2025-05-01 07:24:27.107737+00
updated_at2025-05-01 09:08:28.871009+00
descriptionA robust Rust library for extracting and cleaning tabular data from messy CSV files using Polars.
homepagehttps://github.com/sanjaysingh13/csv_polars_cleaner
repositoryhttps://github.com/sanjaysingh13/csv_polars_cleaner
max_upload_size
id1655931
size79,341
Sanjay Singh (sanjaysingh13)

documentation

https://docs.rs/csv_polars_cleaner

README

csv_polars_cleaner

A robust Rust library for extracting and cleaning tabular data from messy CSV files using the Polars DataFrame engine.

Objective

  • To reliably parse CSV files that may contain metadata, comments, empty lines, or other non-tabular content before or after the actual data table.
  • To automatically detect the start and end of the true data region using statistical heuristics (mode of column counts).

Functionality

  • Skips metadata, comments, and blank lines to find the real table header and data.
  • Uses the most frequent column count to infer the bounds of the data block.
  • Returns a Polars DataFrame for further analysis or processing.
  • Provides clear error messages for malformed or unsupported files.

Limitations

  • Only supports single-table CSVs (not multi-table or hierarchical data).
  • Assumes the delimiter is consistent within the data region (default: ,).
  • Does not attempt to infer or repair rows with inconsistent column counts within the main data region.
  • Metadata and comments must not contain the delimiter in a way that mimics a table row.

Usage

Add to your Cargo.toml:

[dependencies]
csv_polars_cleaner = "<version>"

Example usage:

use csv_polars_cleaner::parse_folder;

fn main() {
    let folder = "path/to/your/folder";
    match parse_folder(folder, b',') {
        Ok(dfs) => {
            println!("Parsed {} files", dfs.len());
            for (i, df) in dfs.iter().enumerate() {
                println!("\nFile {}:", i + 1);
                println!("Headers: {:?}", df.get_column_names());
                println!("Number of rows: {}", df.height());
            }
        }
        Err(e) => {
            eprintln!("Failed to parse folder: {:?}", e);
        }
    }
}

Command-line Usage

To get started, clone this repository:

git clone https://github.com/sanjaysingh13/csv_polars_cleaner.git
cd csv_polars_cleaner

This crate includes a simple CLI for quickly checking CSV parsing on your system:

cargo run -- path/to/your/folder

This will recursively parse all .csv files in the specified folder and its subfolders.

For more details, see the source code.

View API Documentation (GitHub Pages)

Commit count: 10

cargo fmt