# `hashcsv`: Use CSV row contents to assign an ID to each row `hashcsv` will take a CSV file as input, and output the same CSV data, appending an `id` column. The `id` column contains a UUID v5 hash of the normalized row contents. This tool is written in moderately optimized Rust and it should be suitable for large CSV files. It had a throughput of roughly 65 MiB/s when tested on a developer laptop. ## Usage This can be invoked as either of: ```sh hashcsv input.csv > output.csv hashcsv < input.csv > output.csv ``` If `input.csv` contains: ```csv a,b,c 1,2,3 1,2,3 4,5,6 ``` Then `output.csv` will contain: ```csv a,b,c,id 1,2,3,ab37bf3a-c35c-51a9-802d-8eda9ee2f50a 1,2,3,ab37bf3a-c35c-51a9-802d-8eda9ee2f50a 4,5,6,481492ee-82c7-58b9-95ec-d92cbcd332c4 ``` There is also an option for renaming the `id` column. See `--help` for details. ## Limitations: Birthday problem UUID v5 is based on an SHA hash, and it preserves 122 bits of the hash output. This means that if you hash 2^(122/2) = 2^61 ≈ 2.3×10^18 rows, you should expect to have a 50% change of at least one collision. This is 2.3 _quintillion_ rows, which should be adequate for many applications. See [the birthday problem](https://en.wikipedia.org/wiki/Birthday_problem) for more information. ## Benchmarking To measure throughput, build in release mode: ```sh cargo build --release --target x86_64-unknown-linux-musl ``` Then use `pv` to measure output speed: ```sh ../target/x86_64-unknown-linux-musl/release/hashcsv test.csv | pv > /dev/null ``` To find where the hotspots are, ```sh perf record --call-graph=lbr \ ../target/x86_64-unknown-linux-musl/release/hashcsv test.csv > /dev/null ```