# Fast Near-duplicate Matching   [![Latest Version]][crates.io] [Latest Version]: https://img.shields.io/crates/v/neardup.svg [crates.io]: https://crates.io/crates/neardup Fast near-duplicate matching is a method for quickly finding near-duplicate spans in a document by utilizing the [Rabin-Karp algorithm](https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm). This algorithm can be used to count the near-duplicates of a query in a pre-training corpus, facilitating the analysis of memorization in Large Language Models (LLMs). This repository contains the implementation of the fast near-duplicate matching algorithm and the benchmark for the algorithm in Rust. The core functionalities is provided as a crate [```neardup```](). ## Method ### Fast Near-duplicate Matching - Input: Suffix $s$, document $d$, and $n$ of $n$-gram - Output: Whether $d$ has a span near-duplicate to $s$ #### Pseudo-code in Python ```python def fast_near_duplicate_matching(s: list[int], d: list[int], n: int, threshold: float) -> bool: l_s = len(s) l_d = len(d) H = set(ngram(s, n)) for i in range(max(l_d - l_s, 0)): if d[i:i+n] in H: for j in range(max(i - l_s + n, 0), i): t = d[j:j+l_s] if Jaccard_W(s, t) >= threshold: return True return False ``` You can use fast hash functions like [fxhash](https://docs.rs/fxhash/latest/fxhash/) or [rolling hash](https://en.wikipedia.org/wiki/Rolling_hash). When the size of $n$ of $n$-gram is small, fxhash is faster than rolling hash. However, when the size of $n$ is large, rolling hash is faster than fxhash because the rolling hash can calculate the hash value of the next $n$-gram in $O(1)$ time. ## How to run You can count the near-duplicates of queries in the document (sample queries (2 queries) and documents (100 documents) are in the `sample_data` folder) by running the following command: ```bash $ cargo run --release -- --search-dir ./sample_data --query-path ./sample_data/query.jsonl --threshold 0.6 --n 10 [2024-08-07T10:59:40Z INFO neardup] query_list_all: 2 [2024-08-07T10:59:40Z INFO neardup] search_path_list len: 1 [2024-08-07T10:59:40Z INFO neardup] path: "./sample_data/pythia-00000-00999.jsonl.gz" start loading token_ids_list [2024-08-07T10:59:40Z INFO neardup] loaded token_ids_list [2024-08-07T10:59:40Z INFO neardup] query: 0 count: 1 [2024-08-07T10:59:40Z INFO neardup] query: 1 count: 1 [2024-08-07T10:59:40Z INFO neardup] path idx: 0 finished [2024-08-07T10:59:40Z INFO neardup] count: [1, 1] ``` ### Count near-duplicates in the Pythia dataset You can download the Pythia dataset from [here](https://github.com/EleutherAI/pythia?tab=readme-ov-file#exploring-the-dataset) After downloading the dataset, you can convert the dataset to the format that this program can read by running the following command: ```bash $ python scripts/prepare_pythia.py --output_dir path/to/output/folder --pythia_data_path path/to/merged/folder/document ``` Then, you can count the near-duplicates of queries in the Pythia dataset by running the following command: ```bash $ cargo run --release -- --search-dir path/to/output/folder --query-path path/to/output/folder/query.jsonl --threshold 0.6 --n 10 ``` ## Benchmark You can run the benchmark for the three methods: - Fast near-duplicate matching with **fxhash** - Fast near-duplicate matching with **rolling hash** - **Naive** near-duplicate matching ```bash $ cargo bench ``` ## References - [Rabin-Karp algorithm](https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm) - [fxhash](https://docs.rs/fxhash/latest/fxhash/) - [Pythia](https://github.com/EleutherAI/pythia)