remedian

Crates.ioremedian
lib.rsremedian
version0.1.0
sourcesrc
created_at2024-10-10 04:17:34.149383
updated_at2024-10-10 04:17:34.149383
description A Rust implementation of The Remedian
homepage
repositoryhttps://github.com/sixfold-origami/remedian
max_upload_size
id1403439
size131,081
Rose Peck (sixfold-origami)

documentation

README

Remedian

Remedian is a Rust implementation of The Remedian, a robust method to approximate the median of a large dataset, without needing to load the entire thing in memory. This is desirable in cases where the dataset is so large that loading the whole thing simultaneously is intractable.

Basic usage:

// The default block is configured with a reasonable b and k for most applications
let mut remedian = RemedianBlock::default();

// Read data points from our data source, and fold them into the remedian
for data_point in some_data_stream {
		remedian.add_sample_point(data_point);
}

// Get our (approximate) answer
let median = remedian.median();

For more details, check out examples/minimal.rs, examples/full.rs, or examples/custom_data.rs

The Remedian

The Remedian algorithm is quite simple in concept. It stores k arrays of size b, (where k and b are hyperparameters). Here, we see an example with 4 arrays of size 11:

Image: figure 1 from the original paper, showing the four arrays with arrows pointing from one array to the next

Figure 1 from the original paper

As points are read in from the stream, each array is filled in turn:

  1. The first 11 sample points fill the first array
  2. The median is calculated for array 1, and this intermediate median is stored in the first cell of array 2
  3. The next 11 sample points refill the first array (retaining the allocation)
  4. The median is again calculated for array 1, and this new intermediate median is stored in the second cell of array 2
  5. This process continues, progressively filling each array, until the final array is completely full of data
  6. Then, the median of the final array is calculated and returned

In this way, The Remedian can account for b^k sample points, while only using b*k space.

Crate features

This crate has a single feature, which is enabled by default

  • logging: Enables use of the log crate for warnings. If disabled, eprintln! is used instead

Testing

A sample file of 2000 randomly generated numbers can be found in test_data/2000_values.txt. The values are uniformly distributed from 1 to 1000, and are used in tests to ensure accuracy.

Commit count: 18

cargo fmt