Crates.io | hypertwobits |
lib.rs | hypertwobits |
version | 0.2.2 |
source | src |
created_at | 2024-08-09 11:52:33.764571 |
updated_at | 2024-08-19 10:19:31.123384 |
description | HyperTwoBits cardinality estimation algorithm |
homepage | |
repository | https://github.com/axiomhq/hypertwobits |
max_upload_size | |
id | 1330762 |
size | 104,872 |
HyperTwoBits
is a probabilistic data structure
that estimates the number of distinct elements in a set. It has the same use case as HyperLogLog
,
but it uses less memory and is faster while achieving roughly similar accuracy. In numbers while
HyperLogLog
uses six bits per substream HyperTwoBits
uses, as the name suggests, two bits
To illustrate Tabe 4 from the linked paper page 14:
This implementation improves on the proposed algorithm in a few ways.
ahash
for hashing, but you can use any hasher that implements std::hash::AHasherBuilder
.use hypertwobits::{HyperTwoBits, M512};
let mut htb = HyperTwoBits::<M512>::default();
htb.insert(&"foo");
htb.insert(&"bar");
htb.count();
By default, HyperTwoBits
uses ahash
as the hasher. This hasher is not seeded with a random
value by default. This means that the same input will always produce the same output.
Unlike HashMap's the attack surface this opens up is minimal. The only way to exploit this is to craft an input that would deliberately return imprecise results. This is a non-issue for most use cases.
However, we provide the AHasherBuilder
that creates a random state that is used to seed the hasher.
The tradeoff is that this will be slower, use more memory and means sketches can no longer be merged.
Based on benchmarks it is about 2 times faster than HyperLogLogPlus
and 1.5x faster than HyperBitBit
or respectively 4x and 2.5x faster when used with non-seeded hashers.
The accuracy of HyperTwoBits
is roughyly similar to HyperLogLog
but significantly inferior to HyperLogLogPlus
. A few measurements based on example inputs: