Crates.io | gaoya |
lib.rs | gaoya |
version | 0.2.0 |
source | src |
created_at | 2022-02-08 03:02:35.526356 |
updated_at | 2023-06-23 06:44:03.385782 |
description | Locality Sensitive Hashing Data Structures |
homepage | https://github.com/serega/gaoya |
repository | https://github.com/serega/gaoya |
max_upload_size | |
id | 528836 |
size | 127,441 |
This project implements Locality Sensitive Hashing algorithms and data structures for indexing and querying text documents. The primary use cases for Gaoya are deduplication and clustering.
64,32,16,8 bit minhash
64,128 bit simhash
Fast implementation in Rust
Multi-threaded thanks to rayon
Python bindings
>>> import gaoya
>>> index = gaoya.minhash.MinHashStringIndex(hash_size=32,
jaccard_threshold=0.5,
num_bands=42,
band_size=3,
num_hashes=42*3,
analyzer='word',
lowercase=True,
ngram_range=(1,1))
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third document.',
... 'Is this the first document?',
... 'This not the first nor the second nor the third, but the fourth document'
... ]
>>>
>>> for i, doc in enumerate(corpus): index.insert_document(i, doc)
...
>>> index.query('This is the first document.')
[0, 1, 2, 3]
>>>
$ pip3 install gaoya
Document Deduplication with Gaoya
use gaoya::minhash::{MinHashIndex, MinHasher32, MinHasher} ;
use gaoya::text::whitespace_split;
use fxhash::FxHashSet;
let corpus = [
"This is the first document.",
"This document is the second document.",
"And this is the third document.",
"Is this the first document?",
"This not the first nor the second nor the third, but the fourth document"];
let (num_bands, band_width) = (42, 3);
let minhasher = MinHasher32::new(num_bands * band_width);
let mut index = MinHashIndex::new(num_bands, band_width, 0.5);
for (i, doc) in corpus.iter().enumerate() {
index.insert(i, minhasher.create_signature(whitespace_split(&doc.to_lowercase())));
}
for (i, doc) in corpus.iter().enumerate() {
if i < 4 {
let mut expected = FxHashSet::default();
expected.extend(vec![0, 1, 2, 3].into_iter());
let signature = minhasher.create_signature(whitespace_split(&doc.to_lowercase()));
assert_eq!(index.query_owned(&signature), expected);
} else {
let mut expected = FxHashSet::default();
expected.insert(4);
let signature = minhasher.create_signature(whitespace_split(&doc.to_lowercase()));
assert_eq!(index.query_owned(&signature), expected);
}
}
[1] Chapter 3, Mining of Massive Datasets
[2] Similarity Estimation Techniques from Rounding Algorithms