| Crates.io | digit-bin-index |
| lib.rs | digit-bin-index |
| version | 0.4.2 |
| created_at | 2025-09-05 21:15:35.441773+00 |
| updated_at | 2025-09-18 21:29:27.7065+00 |
| description | A high-performance, O(P) data structure for weighted random sampling of binned probabilities, ideal for large-scale simulations. |
| homepage | |
| repository | https://github.com/Roenbaeck/digit-bin-index |
| max_upload_size | |
| id | 1826224 |
| size | 140,084 |
A DigitBinIndex is a tree-based data structure designed for efficient weighted random selection and removal from large collections of items. It is optimized for scenarios involving millions of items where probabilities are approximate and high performance is critical, such as simulations for Wallenius' noncentral hypergeometric distribution or Fisher's noncentral hypergeometric distribution.
This library provides high-performance solutions for both major types of noncentral hypergeometric distributions:
select_and_remove, where items are selected and removed one at a time.select_many_and_remove, where a batch of unique items is selected and removed together.In simulations, forecasts, or statistical models (e.g., mortality models, Monte Carlo simulations, or machine learning sampling), managing a large, dynamic set of probabilities is common. A key task is to randomly select items based on their weights, often removing them afterward, and repeat this process efficiently. For datasets with millions of items, achieving high performance while maintaining reasonable accuracy is a significant challenge, especially for complex distributions like Wallenius' or Fisher's.
DigitBinIndex is a radix tree that organizes items into bins based on the decimal digits of their rescaled probabilities, enabling fast weighted random selection and updates.
Digit-based Tree Structure: Each level of the tree corresponds to a decimal place of the rescaled weight. For example, a weight of 0.543 at precision 3 is rescaled to 543 and placed by traversing the path: root -> child[5] -> child[4] -> child[3].
Adaptive Bin Storage: Leaf nodes act as bins, storing item IDs in either a fast Vec<u32> (for small bins) or a compressed Roaring Bitmap (for large bins). The bin type is chosen automatically for optimal performance and memory use if a capacity hint is given.
Accumulated Value Index: Each node tracks the accumulated_value (sum of weights beneath it), supporting O(P) weighted random selection, where P is the configured precision (number of decimal places).
select_and_remove) and Fisher's (select_many_and_remove) distributions.pyo3 for cross-language support.DigitBinIndex trades a small, controllable amount of precision by binning probabilities to achieve significant performance gains. The following benchmarks compare DigitBinIndex against an optimized FenwickTree implementation in a realistic, high-churn simulation.
The benchmark scenario starts with a large population (1M or 10M items), then simulates a high volume of activity:
This measures the real-world throughput of both data structures under dynamic conditions. The tests were run on a standard desktop (Intel i7, 16GB RAM, Rust 1.75).
This benchmark simulates sequential selection by removing 100,000 items one-by-one, then adding 110,000 new items. This is a common pattern in agent-based models or iterative simulations. DigitBinIndex's O(P) complexity gives it a decisive advantage as the population scales.
| Scenario (N items) | DigitBinIndex Time |
FenwickTree Time |
Speedup Factor |
|---|---|---|---|
| 1 Million Items (p=3) | ~27.5 ms | ~82.2 ms | ~3.0x faster |
| 1 Million Items (p=5) | ~39.7 ms | ~77.7 ms | ~2.0x faster |
| 10 Million Items (p=3) | ~551.0 ms | ~1723.1 ms | ~3.1x faster |
DigitBinIndex is over 3.0 times faster than the FenwickTree for sequential operations on large datasets. Its performance is dependent on precision (P) and not the number of items (N), allowing it to scale far more effectively.This benchmark simulates simultaneous selection by removing a large batch of 100,000 unique items at once, followed by the acquisition of 110,000 new items. DigitBinIndex uses a highly optimized batch rejection sampling method.
| Scenario (N items) | DigitBinIndex Time |
FenwickTree Time |
Speedup Factor |
|---|---|---|---|
| 1 Million Items (p=3) | ~19.5 ms | ~104.2 ms | ~5.3x faster |
| 1 Million Items (p=5) | ~39.9 ms | ~106.6 ms | ~2.7x faster |
| 10 Million Items (p=3) | ~389.0 ms | ~1649.9 ms | ~4.2x faster |
Key Takeaway: For batch selections, DigitBinIndex is even more efficient, performing up to 5.3 times faster. The batched nature of the operation further highlights the architectural advantages of the radix tree approach for this use case.
Use DigitBinIndex when:
Consider a Fenwick Tree if you require exact precision and your weights differ only at high decimal places (e.g., 0.12345 vs. 0.12346), though this comes at the cost of O(log N) complexity and higher memory usage for large datasets.
The precision parameter controls the radix tree's depth, balancing accuracy, performance, and memory. Higher precision improves sampling accuracy but increases memory usage (up to 10x per additional level) and slightly impacts runtime.
A precision of 3 or 4 (default: 3) is recommended for most applications. This captures sufficient detail for typical weight distributions while maintaining excellent performance and low memory usage.
Each decimal place contributes exponentially less to a weight’s value. For a weight of 0.12345:
1): 0.12): 0.023): 0.0034): 0.0004Truncating at 3 digits limits the error per item to <0.001. In large populations, these errors average out, minimally affecting the selection distribution.
| Precision | Typical Use Case | Trade-offs |
|---|---|---|
| 1-2 | Maximum performance, minimal memory usage. | Best for coarse weights (e.g., 0.1, 0.5). Loses accuracy with fine-grained data. |
| 3-4 | Recommended Default. Optimal for most scenarios. | Captures sufficient detail for simulation or model data. Negligible performance/memory cost. |
| 5+ | High-fidelity scenarios with very close weights. | Distinguishes weights like 0.12345 vs. 0.12346. Increases memory (up to 10x per level) and slightly impacts performance. |
The DigitBinIndex is designed to handle a vast range of use cases, from a few thousand items to trillions, by automatically selecting the most appropriate internal storage engine.
The index accepts u64 for individual item IDs. However, the internal storage of these IDs depends on the backend chosen.
To provide the best balance of performance and memory usage, the library's DigitBinIndex is an enum that automatically switches between three different backends (Small, Medium, and Large) when you use the with_precision_and_capacity() constructor or the explicit constructors small(), medium(), and large().
The selection is based on a simple heuristic: the average number of items expected per bin, which is calculated as capacity / 10^precision.
Small (Vec<u32>):
small(precision: u8).u32 (max 4 billion).select_and_remove speed is the absolute priority (O(1) swap_remove).u64 IDs. remove_many can be O(N) per bin.Medium (RoaringBitmap):
medium(precision: u8).u32 (max 4 billion).u32. Provides excellent memory compression and fast set operations (including remove_many).u64 IDs.Large (RoaringTreemap):
large(precision: u8).u64 (max 18 quintillion = 18 billion billions).u64 support is required.Here are some practical examples of how calling with_precision_and_capacity translates into a specific internal engine.
Small (Vec<u32>) is ChosenYou are simulating a population of 100,000 individuals with u32 IDs.
// Expecting 100,000 items with 3-digit precision
let index = DigitBinIndex::with_precision_and_capacity(3, 100_000);
10^3 = 1,000. The average items per bin is 100,000 / 1,000 = 100.Small variant is chosen. This provides the fastest O(1) select_and_remove performance. (This would truncate u64 IDs).Medium (RoaringBitmap) is ChosenYou need to index 50 million product IDs, all of which fit within u32.
// Expecting 50 million items with 3-digit precision
let index = DigitBinIndex::with_precision_and_capacity(3, 50_000_000);
50,000,000 / 1,000 = 50,000.Medium variant is selected, using RoaringBitmap. This will be highly memory-efficient and very fast for all operations, including remove_many. (This would also truncate u64 IDs).Large (RoaringTreemap) is ChosenYou are working with a massive dataset where item IDs are 64-bit, and you expect trillions of entries.
// Expecting 5 trillion items with 3-digit precision
let index = DigitBinIndex::with_precision_and_capacity(3, 5_000_000_000_000);
5_000_000_000_000 / 1,000 = 5,000,000,000.Large variant is chosen. The heuristic correctly identifies this as a u64-scale problem and selects the only backend, RoaringTreemap, that provides full 64-bit ID support.DigitBinIndex is available as a Python library on PyPI or as a Rust crate on Crates.io. Ensure Python 3.6+ for Python bindings or Rust 1.75+ for the Rust crate.
Install from PyPI:
pip install digit-bin-index
Example usage:
from digit_bin_index import DigitBinIndex
def main():
# Create an index with precision 3 (default).
index = DigitBinIndex()
# With custom precision
index_5 = DigitBinIndex.with_precision(5)
# With custom precision and capacity hint for large datasets
# This might choose a more memory-efficient internal storage.
index_3_xl = DigitBinIndex.with_precision_and_capacity(3, 10_000_000)
# Add items with IDs and weights.
index.add(id=101, weight=0.123) # Low weight
index.add(id=202, weight=0.800) # High weight
index.add(id=303, weight=0.755) # High weight
index.add(id=404, weight=0.110) # Low weight
# Sequential (Wallenius') Draw: Select and remove one item.
# Higher-weighted items (202, 303) are more likely.
selected_item = index.select_and_remove()
if selected_item:
# The returned weight is a float, representing the bin's average weight
item_id, weight = selected_item
print(f"Wallenius draw: ID {item_id}, Weight ~{weight:.3f}")
print(f"Items remaining: {index.count()}") # 3
# Simultaneous (Fisher's) Draw: Select and remove 2 unique items.
selected_items = index.select_many_and_remove(2)
if selected_items:
print(f"Fisher's draw: {selected_items}")
print(f"Items remaining: {index.count()}") # 1
if __name__ == "__main__":
main()
Add to your Cargo.toml:
[dependencies]
digit-bin-index = "0.4.2" # Replace with the latest version from crates.io
Example usage:
use digit_bin_index::DigitBinIndex;
fn main() {
// Create an index with precision 3.
let mut index = DigitBinIndex::with_precision(3);
// Add items with IDs and f64 weights.
index.add(101, 0.123); // Low weight
index.add(202, 0.800); // High weight
index.add(303, 0.755); // High weight
index.add(404, 0.110); // Low weight
// Sequential (Wallenius') Draw: Select and remove one item.
if let Some((id, weight)) = index.select_and_remove() {
println!("Wallenius draw: ID {}, Weight ~{}", id, weight);
}
println!("Items remaining: {}", index.count()); // 3
// Simultaneous (Fisher's) Draw: Select and remove 2 unique items.
if let Some(items) = index.select_many_and_remove(2) {
println!("Fisher's draw: {:?}", items);
}
println!("Items remaining: {}", index.count()); // 1
}
This project is licensed under the MIT License, a permissive open-source license allowing free use, modification, and distribution.