# `hashcsv`: Use CSV row contents to assign an ID to each row

`hashcsv` will take a CSV file as input, and output the same CSV data, appending an `id` column. The `id` column contains a UUID v5 hash of the normalized row contents. This tool is written in moderately optimized Rust and it should be suitable for large CSV files. It had a throughput of roughly 65 MiB/s when tested on a developer laptop.

## Usage

This can be invoked as either of:

```sh
hashcsv input.csv > output.csv
hashcsv < input.csv > output.csv
```

If `input.csv` contains:

```csv
a,b,c
1,2,3
1,2,3
4,5,6
```

Then `output.csv` will contain:

```csv
a,b,c,id
1,2,3,ab37bf3a-c35c-51a9-802d-8eda9ee2f50a
1,2,3,ab37bf3a-c35c-51a9-802d-8eda9ee2f50a
4,5,6,481492ee-82c7-58b9-95ec-d92cbcd332c4
```

There is also an option for renaming the `id` column. See `--help` for details.

## Limitations: Birthday problem

UUID v5 is based on an SHA hash, and it preserves 122 bits of the hash output.

This means that if you hash 2^(122/2) = 2^61 ≈ 2.3×10^18 rows, you should expect to have a 50% change of at least one collision. This is 2.3 _quintillion_ rows, which should be adequate for many applications. See [the birthday problem](https://en.wikipedia.org/wiki/Birthday_problem) for more information.

## Benchmarking

To measure throughput, build in release mode:

```sh
cargo build --release --target x86_64-unknown-linux-musl
```

Then use `pv` to measure output speed:

```sh
../target/x86_64-unknown-linux-musl/release/hashcsv test.csv | pv > /dev/null
```

To find where the hotspots are,

```sh
perf record --call-graph=lbr \
    ../target/x86_64-unknown-linux-musl/release/hashcsv test.csv > /dev/null
```