hashcsv

Crates.iohashcsv
lib.rshashcsv
version1.0.1
sourcesrc
created_at2021-02-10 14:44:35.691456
updated_at2022-05-25 15:09:22.104824
descriptionAppend an `id` column to each row of a CSV file, containing a UUID v5 hash of the row
homepagehttps://github.com/faradayio/csv-tools/blob/main/hashcsv/README.md
repositoryhttps://github.com/faradayio/csv-tools
max_upload_size
id353245
size22,206
Eric Kidd (emk)

documentation

README

hashcsv: Use CSV row contents to assign an ID to each row

hashcsv will take a CSV file as input, and output the same CSV data, appending an id column. The id column contains a UUID v5 hash of the normalized row contents. This tool is written in moderately optimized Rust and it should be suitable for large CSV files. It had a throughput of roughly 65 MiB/s when tested on a developer laptop.

Usage

This can be invoked as either of:

hashcsv input.csv > output.csv
hashcsv < input.csv > output.csv

If input.csv contains:

a,b,c
1,2,3
1,2,3
4,5,6

Then output.csv will contain:

a,b,c,id
1,2,3,ab37bf3a-c35c-51a9-802d-8eda9ee2f50a
1,2,3,ab37bf3a-c35c-51a9-802d-8eda9ee2f50a
4,5,6,481492ee-82c7-58b9-95ec-d92cbcd332c4

There is also an option for renaming the id column. See --help for details.

Limitations: Birthday problem

UUID v5 is based on an SHA hash, and it preserves 122 bits of the hash output.

This means that if you hash 2^(122/2) = 2^61 ≈ 2.3×10^18 rows, you should expect to have a 50% change of at least one collision. This is 2.3 quintillion rows, which should be adequate for many applications. See the birthday problem for more information.

Benchmarking

To measure throughput, build in release mode:

cargo build --release --target x86_64-unknown-linux-musl

Then use pv to measure output speed:

../target/x86_64-unknown-linux-musl/release/hashcsv test.csv | pv > /dev/null

To find where the hotspots are,

perf record --call-graph=lbr \
    ../target/x86_64-unknown-linux-musl/release/hashcsv test.csv > /dev/null
Commit count: 145

cargo fmt