Crates.io | scrubcsv |
lib.rs | scrubcsv |
version | 1.0.0 |
source | src |
created_at | 2016-12-22 21:42:51.794284 |
updated_at | 2022-05-25 13:26:56.350756 |
description | Remove bad lines from large CSV files and normalize the rest |
homepage | https://github.com/faradayio/scrubcsv |
repository | https://github.com/faradayio/scrubcsv |
max_upload_size | |
id | 7729 |
size | 46,765 |
This is a CSV cleaning tool based on BurntSushi's
excellent csv
library. It's
intended to be used for cleaning up and normalizing large data sets before
feeding them to other CSV parsers, at the cost of discarding the occasional
row. This program may further mangle syntactically-invalid CSV data!
See below for details.
To install, first install Rust if you haven't already:
curl https://sh.rustup.rs -sSf | sh
Then install scrubcsv
using Cargo:
cargo install scrubcsv
Run it:
$ scrubcsv giant.csv > scrubbed.csv
3000001 rows (1 bad) in 51.58 seconds, 72.23 MiB/sec
For more options, run:
scrubcsv --help
We assume that, given hundreds of gigabytes of CSV from many sources, many files will contain a few unparsable lines.
Lines of the following form:
Name,Phone
"Robert "Bob" Smith",(202) 555-1212
...are invalid according the RFC 4180 because the quotes around "Bob"
are
not escaped. The creator the file probably intended to write:
Name,Phone
"Robert ""Bob"" Smith",(202) 555-1212
scrubcsv
will currently output this as:
Name,Phone
"Robert Bob"" Smith""",(202) 555-1212
If the resulting line has the wrong number of columns, it will be discarded. The precise details of cleanup and discarding are subject to change. The goal is to preserve data in valid CSV files, and to make a best effort to salvage or discard records that can't be parsed without being too picky about the details.
This is designed to be relatively fast. For comparison purposes, on particular laptop:
cat /dev/zero | pv > /dev/null
shows a throughput of about 5 GB/s.scrubcsv
could reach
about 3.5 GB/s.csv
parser can reach roughly 235 MB/s in zero-copy mode.scrubcsv
hits 49 to 125 MB/s.Unfortunately, we can't really use csv
's zero-copy mode because we need
to see an entire row at once to decide whether or not it's valid before
deciding to output it. We could, I suppose, memmove
each field as we see
it into an existing buffer to avoid malloc
overhead (which is almost
certianly the bottleneck here), but that would require more code. Still,
file an issue if performance is a problem. We could probably make this a
maybe two to four times faster (and it would be fun to optimize).