# csv-sanity Preserve your sanity is a world full of malformed, poorly validated CSV files. Sanitize and transform large CSVs with millions of records quickly and efficiently. **NOTE:** csv-sanity is in an alpha state and is subject to breaking changes. The ruleset file syntax in particular is likely to change in the near future. I've personally used csv-sanity on a number of projects and it has been incredibly helpful, but as with most alpha software csv-sanity is provided as-is and provides no warranty or guarantee. Use at your own risk and double check your transformed files! ## Purpose The CSV format is not well-standardized and has many shortfalls when it comes to storing large numbers of records with complex data formats, but CSVs are ubiquitous in many realms as a neutral interchange format that most CRMs and database software can parse and understand. But what happens when your CRM can only parse ISO 8601 formatted dates and the CSV you inherited has dates in another format such as the following: ```csv id,name,signup_date 2,John Doe,11/22/2017 3,Jane Doe,11/28/2017 ``` Or you received a CSV of people who you need to contact via a personalized email, but your contacts' names in the CSV are in ALL CAPS: ```csv id,first_name,last_name 2,JOHN,DOE 3,JANE,DOE ``` Or you have a CSV that has valid values for the vast majority of records, but 1 out of every 20k records has nonsense values that cause your entire import to abort: ```csv id,fist_name,last_name,party_registration 2,Jane,Doe,REP 3,John,Doe,DEM 345,Josh,Smith,HAHAHAHA ``` Or even a CSV that has a few malformed records due to unescaped commas: ```csv id,first_name,last_name,email 2,Jane,Doe,jane@example.com 3,John,Doe,"i,don't,follow,the,rules"@example.com ``` These are all real problems I've encountered with CSVs over the years. If the CSV is small enough they can be corrected by hand, but for CSVs with 10k, 100k or even millions of records correcting by hand simply isn't a viable option. `csv-sanity` aims to solve the issue of sanitizing large, poorly-validated CSVs. ## Usage `csv-sanity` is an executable that takes an input CSV to process and a JSON ruleset file defining the transformation rules to apply: ```bash csv-sanity [-r RULESET_FILE] ``` If a path to a ruleset file is not provided via the `-r` option, `csv-sanity` will look for a file named "ruleset.json" in the current directory. By default, `csv-sanity` outputs two files to the current directory: output.csv, which contains the processed CSV with validated and transformed records, and errors.csv, which contains a list of records and fields that couldn't be processed and reasons they were rejected. The paths where the output and error files are output can be overridden via the `-o FILE_PATH` and `-e FILE_PATH` options, respectively. ## ruleset.json Syntax Ruleset files are JSON files that define a collection of transformation rules and the fields to which they should be applied. The following is an example ruleset JSON file: ```json { "rules": [ { "applicability": { "Global": [], }, "transformer": { "None": { "regex": "\\A(?:[:cntrl:]|\\s)*\\z" } }, "priority": -10 }, { "applicability": { "Global": [], }, "transformer": { "Trim": {} }, "priority": -10 }, { "applicability": { "Fields": { "field_names": [ "first_name", "last_name" ] } }, "transformer": { "Capitalize": {} } } ] } ``` Every ruleset.json file is a JSON object with a single "rules" field with an array of rule objects. Rules are objects with two fields: - **"applicability"**: specifies whether a rule applies globally or only to a predefined set of fields (specified as the column headers in the CSV being processed) - **"transformer"**: a transformer object, which specifies how the applicable fields should be transformed. ### Transformers #### Capitalize ```json { "Capitalize": {} } ``` Transforms string fields into Capital Case. #### Choice ```json { "Choice": { "choices": [ "A", "B", "C" ] } } ``` Only accepts a pre-defined list of acceptable values and rejects the rest. #### Date ```json { "Date": { "input_formats": [ "%m/%d/%Y" ], "output_formats": "%F" } } ``` Attempt to parse fields with a list of datetime formats via [time::strptime](https://docs.rs/time/0.1.37/time/fn.strptime.html). See the docs for the [time](https://docs.rs/time/0.1.37/time/index.html) crate for details on datetime formating syntax. #### Email ```json { "Email": {} } ``` Attempt to parse fields as email addresses, rejecting any fields that appear to be invalid email addresses. #### None ```json { "None": { "regex": "\\A(?:[:cntrl:]|\\s)*\\z" } } ``` Replace matched fields with a blank value. Useful as a global rule for normalizing blank fields in a CSV file. #### Number ```json { "Number": {} } ``` Attempt to parse fields as whole integers, rejecting any fields that cannot be parsed. #### PhoneNumber ```json { "PhoneNumber": {} } ``` Attempt to parse files as US, NANP-formatted phone numbers, transforming them into a standard international format of `+1 `. #### Regex ```json { "Regex": { "regex": "\\A([A-Z])[A-Z]+\\z", "template": "$1" } } ``` Match fields against the provided regex pattern and transform them according to the template string, replacing capture groups placeholders. See the [Regex::replace](https://docs.rs/regex/0.2.1/regex/struct.Regex.html#method.replace) in the regex crate docs for details. #### RegexMatch ```json { "RegexMatch": { "regex": "\\A[A-Z]{2,3}\\z", "negate": false } } ``` Reject any fields that fail to match against the provided regex pattern. If `negate` is `true`, the reject any fields that match the provided regex pattern instead. #### Trim ```json { "Trim": {} } ``` Trim leading and trailing whitespace from fields. Useful as a global rule to normalize fields and remove useless whitespace. #### Zipcode ```json { "Zipcode": {} } ``` Attempt to parse fields as US zip codes in the formats "xxxxx" and "xxxxx-xxxx", rejecting any fields that fail to match that format.