cardinality-estimator-safe

Crates.iocardinality-estimator-safe
lib.rscardinality-estimator-safe
version
sourcesrc
created_at2025-05-02 19:48:30.440165+00
updated_at2025-05-14 20:22:53.717522+00
descriptionEstimate the cardinality of distinct elements in a stream or dataset with no unsafe code
homepage
repositoryhttps://github.com/uniphil/cardinality-estimator-safe
max_upload_size
id1658134
Cargo.toml error:TOML parse error at line 21, column 1 | 21 | autolib = false | ^^^^^^^ unknown field `autolib`, expected one of `name`, `version`, `edition`, `authors`, `description`, `readme`, `license`, `repository`, `homepage`, `documentation`, `build`, `resolver`, `links`, `default-run`, `default_dash_run`, `rust-version`, `rust_dash_version`, `rust_version`, `license-file`, `license_dash_file`, `license_file`, `licenseFile`, `license_capital_file`, `forced-target`, `forced_dash_target`, `autobins`, `autotests`, `autoexamples`, `autobenches`, `publish`, `metadata`, `keywords`, `categories`, `exclude`, `include`
size0
phil (uniphil)

documentation

https://docs.rs/cardinality-estimator-safe

README

cardinality-estimator-safe

build docs.rs crates.io License

cardinality-estimator-safe is a fork of Cloudflare's cardinality-estimator, replacing its data representations with boring old owned data and eliminating all uses of unsafe. Its serialization formats are intended to be reasonable with serde_json, and efficient with bincode.

cardinality-estimator-safe estimates the number of distinct elements in a stream or dataset in an efficient manner. This library uses HyperLogLog++ with an optimized low memory footprint and high accuracy approach, suitable for large-scale data analysis tasks.

Overview

cardinality-estimator-safe is efficient in terms of memory usage, latency, and accuracy. This is achieved by leveraging a combination of unique data structure design, efficient algorithms, and HyperLogLog++ for high cardinality ranges.

Compared to other crates

  • cardinality-estimator is currently vulnerable to memory safety violations due to its pervasive use of unsafe that appears to contain incorrect assumptions.
  • amadeus-streaming writes out a huge array when serializing small cardinalities. it contains unsafe code without much documentation to justify it, though that code can be avoided by avoiding the SIMD paths.
  • hyperloglogplus serialization is not compact, writing out even phantom data when serializing.

cardinality-estimator-safe is meant for serialization-heavy use-cases. it serializes to a fairly compact representation with serde_json, and is designed to be especially compact with binary encoders like bincode and postcard. you can see samples with cargo run --features with_serde --example json.

cardinality-estimator-safe is generic over BuildHasher instead of Hasher like cardinality-estimator, so it's possible to add a salt/prefix to the hashes for some resistance against malicious estimate inflation when users control any input data.

Crate status

  • all unsafe code has been eliminated, but there are still many potential panics in the code.
  • fuzzing harnesses are improved from cardinality-estimator, and excersized to try to discover any paths to a panic.
  • serialization and deserialization have been completely rewritten.
  • deserialization performance for hyperloglog representations is currently less than optimal, but there are easy ways to make it fast.
  • some of the code could definitely be refactored for further clarity.

Getting Started

To use cardinality-estimator-safe, add it to your Cargo.toml under [dependencies]:

[dependencies]
cardinality-estimator-safe = "2.1.0"

Then, import cardinality-estimator-safe in your Rust program:

use cardinality_estimator_safe::CardinalityEstimator;

let mut estimator = CardinalityEstimator::<12, 6>::new();
estimator.insert("test");
let estimate = estimator.estimate();

println!("estimate = {}", estimate);

Please refer to our examples and benchmarks in the repository for more complex scenarios.

Low memory footprint

The cardinality-estimator achieves low memory footprint by leveraging an efficient data storage format. The data is stored in three different representations - Small, Array, and HyperLogLog - depending on the cardinality range. For instance, for a cardinality of 0 to 2, only 8 bytes of stack memory and 0 bytes of heap memory are used.

Low latency

The crate offers low latency by using auto-vectorization for slice operations via compiler hints to use SIMD instructions. The number of zero registers and registers' harmonic sum are stored and updated dynamically as more data is inserted, resulting in fast estimate operations.

High accuracy

The cardinality-estimator-safe achieves high accuracy by using precise counting for small cardinality ranges and HyperLogLog++ with LogLog-Beta bias correction for larger ranges. This provides expected error rates as low as 0.02% for large cardinalities.

Benchmarks

Benchmarks are added to directly compare Cloudflare's cardinality-estimator with cardinality-estimator-safe. They are included beside Cloudflare's original benchmarks for context.

To run benchmarks you first need to install cargo-criterion binary:

cargo install cargo-criterion

Then benchmarks with output format JSON to save results for further analysis:

make bench

We've benchmarked cardinality-estimator against several other crates in the ecosystem:

Please note, that hyperloglog and probabilistic-collections crates have bug in calculation of precision p based on provided probability:

  • incorrect formula: p = (1.04 / error_probability).powi(2).ln().ceil() as usize;
  • corrected formula: p = (1.04 / error_probability).powi(2).log2().ceil() as usize;

We're continuously working to make cardinality-estimator the fastest, lightest, and most accurate tool for cardinality estimation in Rust.

Benchmarks presented below are executed on Linux laptop with 13th Gen Intel(R) Core(TM) i7-13800H processor and compiler flags set to RUSTFLAGS=-C target-cpu=native.

Memory usage

Cardinality Estimators Memory Usage

Table below compares memory usage of different cardinality estimators. The number in each cell represents stack memory bytes / heap memory bytes / heap memory blocks at each measured cardinality.

cardinality-estimator-safe has a small memory overhead compared to cardinality-estimator. It uses the second-lowest stack memory, and the second-lowest heap memory for cardinalities up to 128, and a small constant overhead for cardinalities 256 and above.

Note, that hyperloglogplus implementation has particularly high memory usage especially for cardinalities above 256.

cardinality cardinality_estimator cardinality_estimator_safe amadeus_streaming probabilistic_collections hyperloglog hyperloglogplus
0 8 / 0 / 0 40 / 0 / 0 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 0 / 0
1 8 / 0 / 0 40 / 0 / 0 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 28 / 1
2 8 / 0 / 0 40 / 0 / 0 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 28 / 1
4 8 / 16 / 1 40 / 16 / 1 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 76 / 2
8 8 / 48 / 2 40 / 64 / 2 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 164 / 3
16 8 / 112 / 3 40 / 160 / 3 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 332 / 4
32 8 / 240 / 4 40 / 352 / 4 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 660 / 5
64 8 / 496 / 5 40 / 736 / 5 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 1040 / 11
128 8 / 1008 / 6 40 / 1504 / 6 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 2261 / 17
256 8 / 4092 / 7 40 / 4588 / 7 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 8081 / 29
512 8 / 4092 / 7 40 / 4588 / 7 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 33455 / 53
1024 8 / 4092 / 7 40 / 4588 / 7 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 136856 / 101
2048 8 / 4092 / 7 40 / 4588 / 7 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 195831 / 120
4096 8 / 4092 / 7 40 / 4588 / 7 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 195831 / 120
8192 8 / 4092 / 7 40 / 4588 / 7 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 195831 / 120
16384 8 / 4092 / 7 40 / 4588 / 7 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 195831 / 120
32768 8 / 4092 / 7 40 / 4588 / 7 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 195831 / 120
65536 8 / 4092 / 7 40 / 4588 / 7 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 195831 / 120
131072 8 / 4092 / 7 40 / 4588 / 7 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 195831 / 120
262144 8 / 4092 / 7 40 / 4588 / 7 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 195831 / 120
524288 8 / 4092 / 7 40 / 4588 / 7 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 195831 / 120
1048576 8 / 4092 / 7 40 / 4588 / 7 48 / 4096 / 1 128 / 4096 / 1 120 / 4096 / 1 160 / 195831 / 120

Insert performance

Cardinality Estimators Insert Time

Table below represents insert time in nanoseconds per element.

cardinality-estimator-safe approaches the performance of cardinality-estimator across most cardinality ranges, despite containing no unsafe code.

cardinality cardinality-estimator cardinality-estimator-safe amadeus-streaming probabilistic-collections hyperloglog hyperloglogplus
0 0.31 0.37 62.17 73.59 73.47 4.02
1 1.94 2.31 70.93 79.98 79.44 35.72
2 1.91 2.32 37.83 41.36 41.55 23.1
4 6.57 5.56 21.28 24.27 23.44 26.72
8 7.22 6.96 13.35 15.4 14.32 26.44
16 6.36 6.58 9.34 11.07 9.68 26.02
32 5.67 5.68 7.15 8.89 7.44 27.16
64 5.33 5.52 6.14 7.89 6.22 27.57
128 5.82 5.7 5.56 7.31 5.66 29.92
256 5.93 5.87 5.3 6.99 5.43 38.69
512 4.94 5.25 5.15 6.83 5.48 56.15
1024 4.77 4.99 5.12 6.73 5.82 92.54
2048 4.41 5.48 5.04 6.68 6.57 66.57
4096 3.95 6.7 5.05 6.66 7.91 34.45
8192 3.78 8.0 5.01 6.69 9.26 18.69
16384 4.71 7.04 4.98 6.66 8.67 10.55
32768 4.63 5.66 4.97 6.65 7.57 6.92
65536 4.06 4.51 4.97 6.65 6.56 4.81
131072 3.52 3.7 4.97 6.64 5.89 3.5
262144 3.12 3.18 4.97 6.64 5.47 2.74
524288 2.86 2.86 4.95 6.63 5.2 2.3
1048576 2.7 2.67 4.96 6.63 5.05 2.05

Estimate performance

Cardinality Estimators Estimate Time

Table below represents estimate time in nanoseconds per call.

cardinality-estimator-safe is matches the performance of cardinality-estimator, with the lowest estimate time for all cardinalities.

Note, that amadeus-streaming implementation is also quite effective at estimate operation, however it has higher memory usage as indicated by table above. Implementations probabilistic-collections, hyperloglogplus and hyperloglogplus have much higher estimate time, especially for higher cardinalities.

cardinality cardinality-estimator cardinality-estimator-safe amadeus-streaming probabilistic-collections hyperloglog hyperloglogplus
0 0.28 0.28 3.97 4543.38 52.74 10.73
1 0.28 0.28 3.97 4557.16 52.87 26.45
2 0.28 0.28 3.97 4556.51 53.03 30.28
4 0.28 0.29 3.97 4563.68 52.71 38.85
8 0.28 0.28 3.96 4552.99 52.52 57.3
16 0.28 0.28 3.97 4593.68 52.73 84.11
32 0.28 0.28 3.97 4630.83 52.73 139.48
64 0.28 0.28 3.97 4723.76 52.78 250.5
128 0.28 0.28 3.97 4866.25 52.84 472.41
256 0.28 0.28 3.97 5159.1 52.85 921.32
512 0.28 0.28 3.97 5529.77 52.8 2028.33
1024 0.28 0.28 3.97 5691.7 52.75 4278.17
2048 0.28 0.28 3.72 5886.09 52.01 2.73
4096 0.28 0.28 20.79 6842.33 52.75 2188.78
8192 0.28 0.28 20.75 7696.39 52.62 2187.54
16384 0.28 0.28 20.71 7505.85 52.47 2179.01
32768 0.28 0.28 4.96 7078.18 1087.54 2174.34
65536 0.28 0.28 2.23 6894.33 6427.93 2187.89
131072 0.28 0.28 2.23 7454.77 7008.28 2186.92
262144 0.28 0.28 2.23 8374.68 7973.47 2187.51
524288 0.28 0.28 2.23 7755.7 7424.26 2189.29
1048576 0.28 0.28 2.23 7734.45 7431.54 2191.21

Error rate

Cardinality Estimators Error Rate

Table below represents average absolute relative error across 100 runs of estimator on random elements at given cardinality.

cardinality-estimator-safe matches cardinality-estimator, performing on par with amadeus-streaming and hyperloglog estimators, but especially well with low error rate for cardinalities up to 128.

Note, that probabilistic-collections implementation seems to have bug in its estimation operation for cardinalities >=32768.

cardinality cardinality-estimator cardinality-estimator-safe amadeus-streaming probabilistic-collections hyperloglog hyperloglogplus
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0025 0.0
8 0.0 0.0 0.0 0.0025 0.0 0.0
16 0.0 0.0 0.0019 0.0013 0.0019 0.0
32 0.0 0.0 0.0041 0.0022 0.0041 0.0
64 0.0 0.0 0.0066 0.0073 0.0072 0.0
128 0.0 0.0 0.0123 0.0135 0.0134 0.0
256 0.008 0.008 0.0097 0.0095 0.0089 0.0
512 0.0088 0.0088 0.01 0.0083 0.0087 0.0
1024 0.008 0.008 0.0094 0.0091 0.0091 0.0
2048 0.0092 0.0092 0.0093 0.0098 0.0096 0.01
4096 0.0099 0.0099 0.0108 0.0113 0.0111 0.0103
8192 0.0096 0.0096 0.0095 0.0127 0.012 0.0109
16384 0.0116 0.0116 0.0107 0.0245 0.0203 0.0117
32768 0.0125 0.0125 0.0109 158000000000000.0 0.0462 0.0116
65536 0.0132 0.0132 0.0133 281000000000000.0 0.0137 0.0118
131072 0.0116 0.0116 0.0121 141000000000000.0 0.0122 0.0127
262144 0.0137 0.0137 0.0144 70400000000000.0 0.0131 0.0116
524288 0.0138 0.0138 0.0136 35200000000000.0 0.0136 0.0121
1.04858e+06 0.0113 0.0113 0.0124 17600000000000.0 0.0136 0.011
Commit count: 0

cargo fmt