Crates.io | arrow-digest |
lib.rs | arrow-digest |
version | 53.0.0 |
source | src |
created_at | 2021-12-09 03:23:52.713836 |
updated_at | 2024-09-17 01:59:02.511879 |
description | Stable hashes for Apache Arrow. |
homepage | |
repository | https://github.com/sergiimk/arrow-digest |
max_upload_size | |
id | 494953 |
size | 71,444 |
Unofficial Apache Arrow crate that aims to standardize stable hashing of structured data.
Today, structured data formats like Parquet are binary-unstable / non-reproducible - writing the same logical data may result in different files on binary level depending on which writer implementation and you use and may vary with each version.
This crate provides a method and implementation for computing stable hashes of structured data (logical hash) based on Apache Arrow in-memory format.
Benefits:
// Hash single array
let array = Int32Array::from(vec![1, 2, 3]);
let digest = ArrayDigestV0::<Sha3_256>::digest(&array);
println!("{:x}", digest);
// Alternatively: Use `.update(&array)` to hash multiple arrays of the same type
// Hash record batches
let schema = Arc::new(Schema::new(vec![
Field::new("a", DataType::Int32, false),
Field::new("b", DataType::Utf8, false),
]));
let record_batch = RecordBatch::try_new(Arc::new(schema), vec![
Arc::new(Int32Array::from(vec![1, 2, 3])),
Arc::new(StringArray::from(vec!["a", "b", "c"])),
]).unwrap();
let digest = RecordsDigestV0::<Sha3_256>::digest(&record_batch);
println!("{:x}", digest);
// Alternatively: Use `.update(&batch)` to hash multiple batches with same schema
While we're working towards v1
we reserve the right to break the hash stability. Create an issue if you're planning to use this crate.
IPFS
and the likes, but this is a stretch as this is not a general-purpose hashing algoStarting from primitives and building up:
Int, FloatingPoint, Decimal, Date, Time, Timestamp
- hashed using their in-memory binary representationBool
- hash the individual values as byte-sized values 1
for false
and 2
for true
Binary, LargeBinary, FixedSizeBinary, Utf8, LargeUtf8
- hash length (as u64
) followed by in-memory representation of the valueList, LargeList, FixedSizeList
- hash length of the list (as u64
) followed by the hash of the sub-array list according to its data type0
(zero) byte
filed_name as utf8
, nesting_level (zero-based) as u64
recursively traversing the schema in the depth-first orderType (in Schema.fb ) |
TypeID (as u16 ) |
Followed by |
---|---|---|
Null | 0 | |
Int | 1 | unsigned/signed (0/1) as u8 , bitwidth as u64 |
FloatingPoint | 2 | bitwidth as u64 |
Binary | 3 | |
Utf8 | 4 | |
Bool | 5 | |
Decimal | 6 | bitwidth as u64 , precision as u64 , scale as u64 |
Date | 7 | bitwidth as u64 , DateUnitID |
Time | 8 | bitwidth as u64 , TimeUnitID |
Timestamp | 9 | TimeUnitID , timeZone as nullable Utf8 |
Interval | 10 | |
List | 11 | items data type |
Struct | 12 | |
Union | 13 | |
FixedSizeBinary | 3 | |
FixedSizeList | 11 | items data type |
Map | 16 | |
Duration | 17 | |
LargeBinary | 3 | |
LargeUtf8 | 4 | |
LargeList | 11 | items data type |
Note that some types (Utf8
and LargeUtf8
, Binary
FixedSizeBinary
and LargeBinary
, List
FixedSizeList
and LargeList
) are represented in the hash the same, as the difference between them is purely an encoding concern.
DateUnit (in Schema.fb ) |
DateUnitID (as u16 ) |
---|---|
DAY | 0 |
MILLISECOND | 1 |
TimeUnit (in Schema.fb ) |
TimeUnitID (as u16 ) |
---|---|
SECOND | 0 |
MILLISECOND | 1 |
MICROSECOND | 2 |
NANOSECOND | 3 |