Crates.io | neardup |
lib.rs | neardup |
version | 0.1.0 |
source | src |
created_at | 2024-08-07 11:12:12.947438 |
updated_at | 2024-08-07 11:12:12.947438 |
description | A library for near-duplicate matching |
homepage | |
repository | |
max_upload_size | |
id | 1328301 |
size | 714,364 |
Fast near-duplicate matching is a method for quickly finding near-duplicate spans in a document by utilizing the Rabin-Karp algorithm. This algorithm can be used to count the near-duplicates of a query in a pre-training corpus, facilitating the analysis of memorization in Large Language Models (LLMs).
This repository contains the implementation of the fast near-duplicate matching algorithm and the benchmark for the algorithm in Rust. The core functionalities is provided as a crate neardup
.
def fast_near_duplicate_matching(s: list[int], d: list[int], n: int, threshold: float) -> bool:
l_s = len(s)
l_d = len(d)
H = set(ngram(s, n))
for i in range(max(l_d - l_s, 0)):
if d[i:i+n] in H:
for j in range(max(i - l_s + n, 0), i):
t = d[j:j+l_s]
if Jaccard_W(s, t) >= threshold:
return True
return False
You can use fast hash functions like fxhash or rolling hash.
When the size of $n$ of $n$-gram is small, fxhash is faster than rolling hash. However, when the size of $n$ is large, rolling hash is faster than fxhash because the rolling hash can calculate the hash value of the next $n$-gram in $O(1)$ time.
You can count the near-duplicates of queries in the document (sample queries (2 queries) and documents (100 documents) are in the sample_data
folder) by running the following command:
$ cargo run --release -- --search-dir ./sample_data --query-path ./sample_data/query.jsonl --threshold 0.6 --n 10
[2024-08-07T10:59:40Z INFO neardup] query_list_all: 2
[2024-08-07T10:59:40Z INFO neardup] search_path_list len: 1
[2024-08-07T10:59:40Z INFO neardup] path: "./sample_data/pythia-00000-00999.jsonl.gz" start loading token_ids_list
[2024-08-07T10:59:40Z INFO neardup] loaded token_ids_list
[2024-08-07T10:59:40Z INFO neardup] query: 0 count: 1
[2024-08-07T10:59:40Z INFO neardup] query: 1 count: 1
[2024-08-07T10:59:40Z INFO neardup] path idx: 0 finished
[2024-08-07T10:59:40Z INFO neardup] count: [1, 1]
You can download the Pythia dataset from here After downloading the dataset, you can convert the dataset to the format that this program can read by running the following command:
$ python scripts/prepare_pythia.py --output_dir path/to/output/folder --pythia_data_path path/to/merged/folder/document
Then, you can count the near-duplicates of queries in the Pythia dataset by running the following command:
$ cargo run --release -- --search-dir path/to/output/folder --query-path path/to/output/folder/query.jsonl --threshold 0.6 --n 10
You can run the benchmark for the three methods:
$ cargo bench