Crates.io | mqf |
lib.rs | mqf |
version | 1.0.0 |
source | src |
created_at | 2019-11-08 21:50:45.323718 |
updated_at | 2019-11-08 21:50:45.323718 |
description | MQF, Mixed Quotient Filter, is a variant of CQF (Counting Quotient Filter) |
homepage | https://github.com/dib-lab/MQF |
repository | https://github.com/dib-lab/MQF |
max_upload_size | |
id | 179489 |
size | 5,851,779 |
Approximate membership query (AMQ) data structures provide approximate representation for data using a smaller amount of memory compared to the real data size. As the name suggests, AMQ answers if a particular element exists or not in a given dataset but with possible false positive errors. AMQ has many examples such as bloom filter, count-min sketch, Quotient Filter, and Counting Quotient Filter(CQF). Here, we are proposing a new AMQ data structure called Mixed Counting Quotient Filter (MQF).
CQF splits the hash-bits of an item into two components: quotient and remaining parts. Quotient Part is used to determine the target slot. The remaining is inserted into the target slot. The insertion algorithm uses a variant of linear probing to resolve collisions. CQF allows counting the number of instances inserted by using slots following the item's reminder as a counter. CQF has relatively complex scheme to encode counters so that it can distinguish counters from item's reminder.
MQF, Mixed Quotient Filter, is a variant of CQF. MQF uses the same probing technique as CQF. MQF has more metadata called fixed size counters and different encoding for the counters. The improvement makes mqf more memory efficient for wider range of zipifan distribution.
When an item is inserted more than one time, MQF first use the fixed size counter to the number of insertions(count). After the fixed size counter is full, MQF store the count in the following slot and fixed-size counter. MQF uses the necessary number of slots to store the count.
In other words, Fixed-size counters is used in counting and marking the slots used for counting. Fixed-size counters for all slots related to the same item store the counter's maximum value except the last one should store value strictly less than the maximum. When the maximum is reached in the last fixed counter, a new slot is added with empty fixed-size counter.
MQF supports counting and removing items. MQF uses variable size counters; therefore, It is memory efficient when count data that follows zipfian distribution where most the items occur once or twice but few items can happen in very high counts..
MQF has lower bits per element than Bloom filter and Count-min sketches (Ref).
MQF has good data locality which makes it efficient when running on secondary storage.
MQF supports add/remove labels for the item.
MQF can be iterated to get the items and counts inserted in the filter.
MQF supports merging function. Two or more filters can be merged into one filter.
MQF can be resized to bigger/ smaller filter.
apt-get install make g++ cmake zlib1g-dev libbz2-dev
make NH=1
make test NH=1
./mqf_test
MQF Initialization requires the estimation of some parameters: number of slots, Key bits, fixed counter size, and labels size.
Fixed-size counter size is estimated from the shape of data distribution. If most of the items are singletons. The fixed-size counter should be limited to 1 bit. However, If a big portion of items is repeated more than one time, a bigger fixed-size counter will hold more counts and thus save slots.
The number of slots is estimated from the number of items expected to be inserted into the filter. Slots are used for inserting the remaining part of the hash of the items and the count. After calculating the required number of slots, multiply the result by 1.05 because MQF can't be filled by more than 95% of its capacity. Then, round the result to the nearest bigger power of two.
Key bits equal to log2(number of slots) + the remaining part bits. the remaining part bits is estimated from the desired accuracy using the formula below.
Label size is straightforward to be estimated. it can be set to zero if labels are not necessary.
void qf_init(QF *qf, uint64_t nslots, uint64_t key_bits, uint64_t label_bits,uint64_t fixed_counter_size, bool mem, const char *path, uint32_t seed);
bool qf_insert(QF *qf, uint64_t key,
uint64_t count,bool lock, bool spin);
uint64_t qf_count_key(const QF *qf, uint64_t key);
bool qf_remove(QF *qf, uint64_t hash, uint64_t count, bool lock, bool spin);
uint64_t qf_add_label(const QF *qf, uint64_t key, uint64_t label, bool lock, bool spin);
uint64_t qf_get_label(const QF *qf, uint64_t key);
uint64_t qf_remove_label(const QF *qf, uint64_t key, bool lock, bool spin);
QF* qf_resize(QF* qf, int newQ, const char * originalFilename=NULL, const char * newFilename=NULL);
void qf_merge(QF *qfa, QF *qfb, QF *qfc);
void qf_multi_merge(QF *qf_arr[], int nqf, QF *qfr);
void qf_invertable_merge(QF *qf_arr[], int nqf, QF *qfr,std::map<uint64_t, std::vector<int> > *inverted_index_ptr);
Qf* qf_arr : input array of filters
int nqf: number of filters
QF* qfr: pointer to the output filter.
map (uint64_t,vector(int) ) inverted_index_ptr: Pointer to the output index.
bool qf_equals(QF *qfa, QF *qfb);
void qf_intersect(QF *qfa, QF *qfb, QF *qfc);
void qf_subtract(QF *qfa, QF *qfb, QF *qfc);
int qf_space(QF *qf);