Crates.io | csv-sanity |
lib.rs | csv-sanity |
version | 0.1.0 |
source | src |
created_at | 2019-02-06 20:53:46.425304 |
updated_at | 2019-02-06 20:53:46.425304 |
description | Sanitize and transform large CSVs with millions of records quickly and efficiently. |
homepage | |
repository | https://github.com/mgeorgehansen/csv-sanity |
max_upload_size | |
id | 113206 |
size | 64,910 |
Preserve your sanity is a world full of malformed, poorly validated CSV files. Sanitize and transform large CSVs with millions of records quickly and efficiently.
NOTE: csv-sanity is in an alpha state and is subject to breaking changes. The ruleset file syntax in particular is likely to change in the near future. I've personally used csv-sanity on a number of projects and it has been incredibly helpful, but as with most alpha software csv-sanity is provided as-is and provides no warranty or guarantee. Use at your own risk and double check your transformed files!
The CSV format is not well-standardized and has many shortfalls when it comes to storing large numbers of records with complex data formats, but CSVs are ubiquitous in many realms as a neutral interchange format that most CRMs and database software can parse and understand.
But what happens when your CRM can only parse ISO 8601 formatted dates and the CSV you inherited has dates in another format such as the following:
id,name,signup_date
2,John Doe,11/22/2017
3,Jane Doe,11/28/2017
Or you received a CSV of people who you need to contact via a personalized email, but your contacts' names in the CSV are in ALL CAPS:
id,first_name,last_name
2,JOHN,DOE
3,JANE,DOE
Or you have a CSV that has valid values for the vast majority of records, but 1 out of every 20k records has nonsense values that cause your entire import to abort:
id,fist_name,last_name,party_registration
2,Jane,Doe,REP
3,John,Doe,DEM
345,Josh,Smith,HAHAHAHA
Or even a CSV that has a few malformed records due to unescaped commas:
id,first_name,last_name,email
2,Jane,Doe,jane@example.com
3,John,Doe,"i,don't,follow,the,rules"@example.com
These are all real problems I've encountered with CSVs over the years. If the CSV is small enough they can be corrected by hand, but for CSVs with 10k, 100k or even millions of records correcting by hand simply isn't a viable option.
csv-sanity
aims to solve the issue of sanitizing large, poorly-validated CSVs.
csv-sanity
is an executable that takes an input CSV to process and a JSON
ruleset file defining the transformation rules to apply:
csv-sanity [-r RULESET_FILE] <INPUT_FILE>
If a path to a ruleset file is not provided via the -r
option, csv-sanity
will look for a file named "ruleset.json" in the current directory.
By default, csv-sanity
outputs two files to the current directory:
output.csv, which contains the processed CSV with validated and transformed
records, and errors.csv, which contains a list of records and fields that
couldn't be processed and reasons they were rejected. The paths where the output
and error files are output can be overridden via the -o FILE_PATH
and
-e FILE_PATH
options, respectively.
Ruleset files are JSON files that define a collection of transformation rules and the fields to which they should be applied.
The following is an example ruleset JSON file:
{
"rules": [
{
"applicability": {
"Global": [],
},
"transformer": {
"None": {
"regex": "\\A(?:[:cntrl:]|\\s)*\\z"
}
},
"priority": -10
},
{
"applicability": {
"Global": [],
},
"transformer": {
"Trim": {}
},
"priority": -10
},
{
"applicability": {
"Fields": {
"field_names": [
"first_name",
"last_name"
]
}
},
"transformer": {
"Capitalize": {}
}
}
]
}
Every ruleset.json file is a JSON object with a single "rules" field with an array of rule objects.
Rules are objects with two fields:
{
"Capitalize": {}
}
Transforms string fields into Capital Case.
{
"Choice": {
"choices": [
"A",
"B",
"C"
]
}
}
Only accepts a pre-defined list of acceptable values and rejects the rest.
{
"Date": {
"input_formats": [
"%m/%d/%Y"
],
"output_formats": "%F"
}
}
Attempt to parse fields with a list of datetime formats via time::strptime. See the docs for the time crate for details on datetime formating syntax.
{
"Email": {}
}
Attempt to parse fields as email addresses, rejecting any fields that appear to be invalid email addresses.
{
"None": {
"regex": "\\A(?:[:cntrl:]|\\s)*\\z"
}
}
Replace matched fields with a blank value. Useful as a global rule for normalizing blank fields in a CSV file.
{
"Number": {}
}
Attempt to parse fields as whole integers, rejecting any fields that cannot be parsed.
{
"PhoneNumber": {}
}
Attempt to parse files as US, NANP-formatted phone numbers, transforming them
into a standard international format of +1 <area_code> <exchange_code> <subscriber_number>
.
{
"Regex": {
"regex": "\\A([A-Z])[A-Z]+\\z",
"template": "$1"
}
}
Match fields against the provided regex pattern and transform them according to the template string, replacing capture groups placeholders. See the Regex::replace in the regex crate docs for details.
{
"RegexMatch": {
"regex": "\\A[A-Z]{2,3}\\z",
"negate": false
}
}
Reject any fields that fail to match against the provided regex pattern. If
negate
is true
, the reject any fields that match the provided regex pattern
instead.
{
"Trim": {}
}
Trim leading and trailing whitespace from fields. Useful as a global rule to normalize fields and remove useless whitespace.
{
"Zipcode": {}
}
Attempt to parse fields as US zip codes in the formats "xxxxx" and "xxxxx-xxxx", rejecting any fields that fail to match that format.