# histogram-sampler [![Crates.io](https://img.shields.io/crates/v/histogram-sampler.svg)](https://crates.io/crates/histogram-sampler) [![Documentation](https://docs.rs/histogram-sampler/badge.svg)](https://docs.rs/histogram-sampler/) [![Build Status](https://travis-ci.org/jonhoo/histogram-sampler.svg?branch=master)](https://travis-ci.org/jonhoo/histogram-sampler) Sample values from a distribution that is approximated by a histogram. This crate solves a very specific problem, and it may not be what you want. It is probably easiest to explain visually. Imagine a histogram like this: ``` 0 |---------| 100 10 |-------| 80 20 |---| 40 30 |--| 30 40 |-| 20 50 |-| 18 60 || 10 70 || 7 ``` Now imagine that it was created by taking a sequence of values, counting how many there are of each, rounding those counts to the nearest ten, and then plotting a histogram of the results. This crate let's you take data like that, and produce a *generator* that will produce a *new*, infinte sequence of values such that the distribution of values matches that of the sequence that produced the histogram. At this point, you may be wondering: why would I ever want such a thing? Well, it turns out that you can model a bunch of useful things this way. For example, take the distribution of votes and comments on articles on a site like [lobste.rs](https://lobste.rs) (which we actually have [data for](https://lobste.rs/s/cqnzl5/)). Imagine that you want to make a workload generator that [simulates the workload that such a site would see](https://github.com/jonhoo/trawler). In particular, you want to make sure you model some articles being more popular than others, some comments being more popular than others, etc. Think of what that data might look like... Many posts will have 0-10 upvotes, some will have 10-20, fewer will have 20-30, etc. Same for comments. Or upvotes on comments. Or child comments per comment. Or stories submitted by users. You get the idea. And all of these can be represented in the form given above where the sampled values would be story/comment/user ids! Let's see an example based on [some real data](https://github.com/jonhoo/trawler/blob/master/data/votes_per_story.dat): ```rust // original histogram data let stories_per_votecount = vec![ (0, 16724), (10, 16393), (20, 4601), (30, 1707), (40, 680), (50, 281), (60, 128), (70, 60), (80, 35), (90, 16), (100, 4), (110, 4), (120, 10), (130, 1), (140, 2), (160, 1), (210, 1), (250, 1), (290, 1), ]; let mut rng = rand::thread_rng(); let vote_sampler = Sampler::from_bins(stories_per_votecount, 10 /* bin width */); for _ in 0.. { use rand::distributions::Distribution; println!("vote for {}", vote_sampler.sample(&mut rng)); } ``` If you were to run that code (which currently runs forever), you'd see an endless sequence of votes for various fictitious story ids in the range `[0, 40650)` (because that's how many stories there are in the original histogram). If you were to assemble a histogram of vote counts per story after about 370k votes, that histogram would look *almost* exactly like the original input histogram! If you sample it earlier or later, it won't, because there will be fewer/more votes in total, which effectively shifts the data towards lower or higher bins. But, the distribution across story ids will be the same, so you can keep using this to simulate a particular skewed workload continuing perpetually. ## Bias with small bin sizes When your bin size is small (<=~10), the very first bin is actually very small. It only holds values in the range `[0, 5)`, since any larger value would round to 10. Unfortunately, this also means that there's a relatively high likelihood of accidentally (we are picking randomly after all) pushing one value that should be in the first bin into the second bin. All it would take is for us to get that value once or twice more than the expected number of times. And it turns out that that's actually decently likely. The net effect of this is that the first bin (again, if the bin width is small) is likely to have too few values (usually by ~5 percentage points), and the second bucket is likely to have too many values (by the same ~5 percentage points).