| Crates.io | wavelet-matrix |
| lib.rs | wavelet-matrix |
| version | 0.4.7 |
| created_at | 2017-10-29 17:04:55.193378+00 |
| updated_at | 2017-12-21 11:56:45.302736+00 |
| description | A wavelet matrix implementation. Supports various near-O(1) queries on large number of symbols or integers. |
| homepage | |
| repository | https://github.com/sekineh/wavelet-matrix-rs |
| max_upload_size | |
| id | 37359 |
| size | 174,230 |
It provides the various analytics on very large sequence of unsigned integers in constant time.
After adding to Cargo.toml, add this line to lib.rs or main.rs.
extern crate wavelet_matrix;
See crate document top for further examples.
Overall Performance
Error evaluation of O(1) SUM
Given an unsigned integer sequence T, it provides the following queries.
.len():
T..lookup(pos):
T[pos].Counting is performed in O(1).
.count(start..end, value):
e which satisfies e == value included in T[start..end].count_prefix(start..end, value, ignore_bit):
e which satisfies e >> ignore_bit == value >> ignore_bit included in T[start..end]192.168.10.0/24. In this case, the ignore_bit will be 8..count_lt(start..end, value):
e which satisfies e < value included in T[start..end].count_gt(start..end, value):
e which satisfies e > value included in T[start..end].count_range(start..end, val_start..val_end):
e which satisfies val_start <= e < val_end included in T[start..end]Searching is performed in O(1) per a next index.
.search(start..end, value):
e which satisfies e == value included in T[start..end].search_prefix(start..end, value, ignore_bit):
e which satisfies e >> ignore_bit == value >> ignore_bit included in T[start..end]Ranking is performed in roughly O(1) with regard to the number of elements n.
.max_k(start..end, val_start..val_end, k):
(value, count) pairs in descending order.val_start..val_end..min_k(start..end, val_start..val_end, k):
(value, count) pairs in ascending order.val_start..val_end..top_k() is also performed in O(1) in best case, but may take O(n) in the worst case where every value occurs only once!
.top_k(start..end, val_start..val_end, k):
(value, count) pairs in most-frequent-one-first order.val_start..val_end.To achieve O(1) performance regardless of the number of unique values, use .top_k_ranges() instead:
.top_k_ranges(start..end, val_start..val_end, k):
(v_start..v_end, count) pairs in most-frequent-one-first order..top_k(), .top_k_ranges() returns the exhaustive range set that covers all of the values.val_start..val_end..median(start..end):
T[start..end]..quantile(start..end, start + (end - start) / 2)..quantile(start..end, k):
T[start..end].These methods use .top_k_ranges() to enumerate the most relevant value ranges.
They are not as accurate as the method used in Experiment 2.
.sum_experiment1(start..end, val_start..val_end, k):
T[start..end] using up to k wavelet tree nodes.val_start..val_end are processed.k = m + 1 where m is the number of values which are unique..mean_experiment1(start..end, val_start..val_end, k):
T[start..end] using up to k wavelet tree nodes.val_start..val_end are processed.k = m + 1 where m is the number of values which are unique..variance_experiment1(start..end, val_start..val_end, k):
T[start..end] using up to k wavelet tree nodes.val_start..val_end are processed.k = m + 1 where m is the number of values which are unique.Improvement over Experiment 1. They use custom node enumerator to minimize the error.
.sum_experiment2(start..end, val_start..val_end, k):Improvement over Experiment 2. They use Range<u64> to tell how accurate the computed value is.
.sum_experiment3(start..end, val_start..val_end, k):.rank(pos, value): counts value included in T[0..pos].
.rank() always returns 0..select(rank, value): return the position of the (rank+1)-th value
.len() instead of None..median() and .quantile()..median() and .quantile(). They are quite fast, only take 3-5 us on 16-bit values..top_k_ranges() which is faster than .top_k() in worst case..sum_experiment1(), .mean_experiment1() and .variance_experiment1()..sum_experiment2()..sum_experiment3().BENCH.md bench report.BENCH_SUM.md bench report..bit_len()..dim()..top_k(), .max_k() and min_k()..top_k(), .max_k() and min_k()..search() and .search_prefix()..count(), .count_prefix(), .count_lt(), .count_gt() and .count_range().&Vec<u64>, instead of Vec<u64>Add Benchmark.
Profiling
Optimize underlying rsdic structure.
Add travis CI.
Add u128 support or arbitrary-length integer support
The fastest implementation on the planet
A Go package for myriad array operations using wavelet trees
excellent slides in Japanese
The original inventor's pdf: