# arrow-digest [![Crates.io](https://img.shields.io/crates/v/arrow-digest.svg?style=for-the-badge)](https://crates.io/crates/arrow-digest) [![CI](https://img.shields.io/github/actions/workflow/status/sergiimk/arrow-digest/build.yaml?logo=githubactions&label=CI&logoColor=white&style=for-the-badge&branch=master)](https://github.com/sergiimk/arrow-digest/actions) [![Dependencies](https://deps.rs/repo/github/sergiimk/arrow-digest/status.svg?&style=for-the-badge)](https://deps.rs/repo/github/sergiimk/arrow-digest) Unofficial Apache Arrow crate that aims to standardize stable hashing of structured data. ## Motivation Today, structured data formats like Parquet are binary-unstable / non-reproducible - writing the same logical data may result in different files on binary level depending on which writer implementation and you use and may vary with each version. This crate provides a method and implementation for computing **stable hashes of structured data** (logical hash) based on Apache Arrow in-memory format. Benefits: - Fast way to check for equality / equivalence of large datasets - Two parties can compare data without needing to transfer it or reveal its contents - A step towards **content addressability** of structured data (e.g. when storing dataset chunks in DHTs like IPFS) ## Use ```rust // Hash single array let array = Int32Array::from(vec![1, 2, 3]); let digest = ArrayDigestV0::::digest(&array); println!("{:x}", digest); // Alternatively: Use `.update(&array)` to hash multiple arrays of the same type // Hash record batches let schema = Arc::new(Schema::new(vec![ Field::new("a", DataType::Int32, false), Field::new("b", DataType::Utf8, false), ])); let record_batch = RecordBatch::try_new(Arc::new(schema), vec![ Arc::new(Int32Array::from(vec![1, 2, 3])), Arc::new(StringArray::from(vec!["a", "b", "c"])), ]).unwrap(); let digest = RecordsDigestV0::::digest(&record_batch); println!("{:x}", digest); // Alternatively: Use `.update(&batch)` to hash multiple batches with same schema ``` ## Status While we're working towards `v1` we reserve the right to break the hash stability. Create an issue if you're planning to use this crate. - [x] Schema hashing - [x] Fixed size types - [x] Nullability: primitive types - [X] Binaries - [x] Uft8 variants - [x] Structs - [x] Nested structs - [x] Nullability: nested structs - [x] Lists - [ ] Lists of structs - [ ] Dictionaries - [ ] Intervals - [ ] Unions - [ ] Maps - [ ] Metadata endianness check - [ ] Better test coverage + fuzzing - [ ] Performance: Benchmarks - [ ] Performance: Parallelism - [ ] Performance: Code optimization ## Design Goals - Be reasonably fast - Same hash no matter how many batches the input was split into - Same hash no matter if dictionary encoding is used ## Drawbacks - Logical hasing stops short of perfect content addressibility - Logical hashing would need to be supported by `IPFS` and the likes, but this is a stretch as this is not a general-purpose hashing algo - A fully deterministic binary encoding with Parquet compatibility may be a better approach - Proposed method is order-dependent - it will produce different hashes if records are reordered - Boolean hashing could be more efficient ## Hashing Process Starting from primitives and building up: - **Endinanness** - always assume little endian - **Fixed Size Types** - `Int, FloatingPoint, Decimal, Date, Time, Timestamp` - hashed using their in-memory binary representation - `Bool` - hash the individual values as byte-sized values `1` for `false` and `2` for `true` - **Variable Size Types** - `Binary, LargeBinary, FixedSizeBinary, Utf8, LargeUtf8` - hash length (as `u64`) followed by in-memory representation of the value - `List, LargeList, FixedSizeList` - hash length of the list (as `u64`) followed by the hash of the sub-array list according to its data type - **Nullability** - every null value is represented by a `0` (zero) byte - Arrays without validity bitmap have same hashes as arrays that do and all items are valid - **Array Data** - *(once per hashing session)* Hash data type according to the table below - Hash items sequentially using the above rules - **Record Batch Data** - *(once per hashing session)* For every field hash `filed_name as utf8`, `nesting_level (zero-based) as u64` recursively traversing the schema in the **depth-first** order - For every leaf column: - Produce a **combined nullability bitmap** from nullability of every parent - Update corresponding column's hasher using above rules - *(final step)* Digests of every array are fed into the combined hasher to produce the final digest | Type (in `Schema.fb`) | TypeID (as `u16`) | Followed by | | --------------------- | :---------------: | ----------------------------------------------------- | | Null | 0 | | | Int | 1 | `unsigned/signed (0/1) as u8`, `bitwidth as u64` | | FloatingPoint | 2 | `bitwidth as u64` | | Binary | 3 | | | Utf8 | 4 | | | Bool | 5 | | | Decimal | 6 | `bitwidth as u64`, `precision as u64`, `scale as u64` | | Date | 7 | `bitwidth as u64`, `DateUnitID` | | Time | 8 | `bitwidth as u64`, `TimeUnitID` | | Timestamp | 9 | `TimeUnitID`, `timeZone as nullable Utf8` | | Interval | 10 | | | List | 11 | `items data type` | | Struct | 12 | | | Union | 13 | | | FixedSizeBinary | 3 | | | FixedSizeList | 11 | `items data type` | | Map | 16 | | | Duration | 17 | | | LargeBinary | 3 | | | LargeUtf8 | 4 | | | LargeList | 11 | `items data type` | Note that some types (`Utf8` and `LargeUtf8`, `Binary` `FixedSizeBinary` and `LargeBinary`, `List` `FixedSizeList` and `LargeList`) are represented in the hash the same, as the difference between them is purely an encoding concern. | DateUnit (in `Schema.fb`) | DateUnitID (as `u16`) | | ------------------------- | :-------------------: | | DAY | 0 | | MILLISECOND | 1 | | TimeUnit (in `Schema.fb`) | TimeUnitID (as `u16`) | | ------------------------- | :-------------------: | | SECOND | 0 | | MILLISECOND | 1 | | MICROSECOND | 2 | | NANOSECOND | 3 | ## References - [Arrow memory layout](https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout) - [Arrow Flatbuffers schema](https://github.com/apache/arrow/blob/master/format/Schema.fbs)