nano-hyperloglog

Crates.ionano-hyperloglog
lib.rsnano-hyperloglog
version0.1.0
created_at2025-10-03 21:10:54.165587+00
updated_at2025-10-03 21:10:54.165587+00
descriptionA Redis-compatible HyperLogLog service with pluggable storage backends
homepagehttps://github.com/aovestdipaperino/nano-hyperloglog
repositoryhttps://github.com/aovestdipaperino/nano-hyperloglog
max_upload_size
id1867202
size1,615,068
Paperinik (aovestdipaperino)

documentation

https://docs.rs/nano-hyperloglog

README

nano-hyperloglog

nano-hyperloglog

Crates.io Documentation License

A high-performance HyperLogLog implementation in Rust for cardinality estimation with pluggable storage backends.


What is HyperLogLog?

HyperLogLog is a probabilistic data structure that estimates the cardinality (number of unique elements) of a dataset. It's incredibly space-efficient: count billions of unique items using just kilobytes of memory, with typical accuracy within 2% of the true count.

Use cases:

  • Unique visitor counting for web analytics
  • Distinct user tracking across distributed systems
  • Database query optimization (cardinality estimation)
  • Network traffic analysis
  • A/B testing metrics

Features

  • 🚀 Fixed memory usage - Count billions of items with ~16KB (configurable)
  • 🎯 High accuracy - Typically within 0.8-2% of true count
  • 🔀 Mergeable - Combine counts from multiple sources effortlessly
  • 💾 Pluggable storage - File-based or Elasticsearch backends
  • 🌐 HTTP server - Optional Redis-compatible REST API (PFADD/PFCOUNT/PFMERGE)
  • Zero-copy operations - Efficient serialization/deserialization
  • 🦀 Type-safe - Leverage Rust's type system for compile-time guarantees
  • Well-tested - Comprehensive test suite with edge cases

Quick Start

Add to your Cargo.toml:

[dependencies]
nano-hyperloglog = "0.1"

Basic Usage

use hyperloglog::HyperLogLog;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create with precision 14 (16KB memory, ~0.8% error)
    let mut hll = HyperLogLog::new(14)?;

    // Add elements
    for i in 0..100_000 {
        hll.add(&i);
    }

    // Get estimated count
    let count = hll.count();
    println!("Estimated: {} (actual: 100,000)", count);
    // Output: Estimated: 99,723 (actual: 100,000) - 0.28% error!

    Ok(())
}

Merging Counts

Perfect for distributed systems:

use hyperloglog::HyperLogLog;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Three servers tracking visitors
    let mut server1 = HyperLogLog::new(14)?;
    let mut server2 = HyperLogLog::new(14)?;
    let mut server3 = HyperLogLog::new(14)?;

    // Each sees different (potentially overlapping) users
    for i in 0..5000 { server1.add(&format!("user_{}", i)); }
    for i in 2500..7500 { server2.add(&format!("user_{}", i)); }
    for i in 5000..10000 { server3.add(&format!("user_{}", i)); }

    // Merge to get total unique users
    server1.merge(&server2)?;
    server1.merge(&server3)?;

    println!("Total unique users: {}", server1.count());
    // Output: Total unique users: ~10,000

    Ok(())
}

Persistent Storage

use hyperloglog::{HyperLogLog, Storage, storage::FileStorage};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let storage = FileStorage::new("./data").await?;

    // Create and populate
    let mut hll = HyperLogLog::new(14)?;
    hll.add_str("user123");
    hll.add_str("user456");

    // Persist to disk
    storage.store("daily_visitors", &hll).await?;

    // Load later
    let loaded = storage.load("daily_visitors").await?;
    println!("Count: {}", loaded.count());

    Ok(())
}

Precision and Memory Tradeoffs

Precision Memory Standard Error Use Case
10 1 KB ±1.625% Quick estimates, tight memory
12 4 KB ±0.813% Good balance
14 16 KB ±0.406% Default - recommended
16 64 KB ±0.203% High accuracy needed

Feature Flags

Control what gets compiled:

[dependencies]
nano-hyperloglog = { version = "0.1", features = ["elasticsearch-storage", "server"] }

Available features:

  • file-storage (default) - File-based persistence
  • elasticsearch-storage - Elasticsearch backend
  • server - HTTP server with Redis-compatible API
  • full - Everything

HTTP Server

Run a Redis-compatible HTTP service:

cargo run --example server --features server

API Endpoints

# Add elements (PFADD)
curl -X POST http://localhost:3000/pfadd/daily_visitors \
  -H "Content-Type: application/json" \
  -d '{"elements": ["user123", "user456", "user789"]}'

# Get count (PFCOUNT)
curl http://localhost:3000/pfcount/daily_visitors
# {"count": 3}

# Merge multiple HLLs (PFMERGE)
curl -X POST http://localhost:3000/pfmerge/all_visitors \
  -H "Content-Type: application/json" \
  -d '{"source_keys": ["page_home", "page_about"]}'

# Check existence
curl http://localhost:3000/exists/daily_visitors
# true

# List all keys
curl http://localhost:3000/keys
# ["daily_visitors", "all_visitors"]

Configuration

Via environment variables:

# Storage backend
STORAGE_BACKEND=file              # or "elasticsearch"
FILE_STORAGE_PATH=./data          # for file backend
ELASTICSEARCH_URL=http://localhost:9200
ELASTICSEARCH_INDEX=hyperloglog

# Server
BIND_ADDRESS=0.0.0.0:3000

cargo run --example server --features server

Examples

Run examples to see it in action:

# Basic counting
cargo run --example basic_usage

# Merging from multiple sources
cargo run --example merging

# File storage
cargo run --example file_storage --features file-storage

# Compare precision values
cargo run --example precision_comparison

# HTTP server
cargo run --example server --features server

How It Works

HyperLogLog uses a clever trick based on probability theory:

  1. Hash each element to get a uniform bit pattern
  2. Count leading zeros in the hash (rare for few items, common for many)
  3. Partition into multiple "registers" to reduce variance
  4. Estimate cardinality from the average leading zero count

For technical details, see the algorithm explanation.

Performance

Benchmarks on M1 MacBook Pro:

Operation Time Throughput
Add element ~15 ns 66M ops/sec
Count (precision 14) ~8 μs 125K ops/sec
Merge (16KB each) ~12 μs 83K ops/sec
Serialize to JSON ~50 μs 20K ops/sec

Memory usage is fixed: 2^precision bytes (16KB for precision 14).

Why Rust?

  • Zero-cost abstractions - Fast as C, safe as high-level languages
  • Fearless concurrency - Share HyperLogLogs across threads safely
  • No GC pauses - Predictable latency for real-time systems
  • Small binaries - Server example compiles to ~8MB

Comparison with Redis

Feature nano-hyperloglog Redis PFCOUNT
Precision configurable ✅ 4-16 bits ✅ 14 bits
Persistent storage ✅ File/ES ✅ RDB/AOF
HTTP API ✅ Optional
Type-safe API
Startup time < 10ms ~100ms
Memory footprint ~8MB binary ~10MB process

When to use each:

  • Redis: Already using Redis, need sub-millisecond latency, want battle-tested stability
  • nano-hyperloglog: Want dedicated service, need custom storage, building in Rust, serverless deployments

Contributing

Contributions welcome! Please:

  1. Add tests for new features
  2. Run cargo fmt and cargo clippy
  3. Update documentation

License

Licensed under either of:

at your option.

References

Acknowledgments

Inspired by Redis's PFCOUNT and the excellent work of Philippe Flajolet on probabilistic counting algorithms.

Commit count: 0

cargo fmt