neardup

Crates.ioneardup
lib.rsneardup
version0.1.0
sourcesrc
created_at2024-08-07 11:12:12.947438
updated_at2024-08-07 11:12:12.947438
descriptionA library for near-duplicate matching
homepage
repository
max_upload_size
id1328301
size714,364
speed (speed1313)

documentation

README

Fast Near-duplicate Matching   Latest Version

Fast near-duplicate matching is a method for quickly finding near-duplicate spans in a document by utilizing the Rabin-Karp algorithm. This algorithm can be used to count the near-duplicates of a query in a pre-training corpus, facilitating the analysis of memorization in Large Language Models (LLMs).

This repository contains the implementation of the fast near-duplicate matching algorithm and the benchmark for the algorithm in Rust. The core functionalities is provided as a crate neardup.

Method

Fast Near-duplicate Matching

  • Input: Suffix $s$, document $d$, and $n$ of $n$-gram
  • Output: Whether $d$ has a span near-duplicate to $s$

Pseudo-code in Python

def fast_near_duplicate_matching(s: list[int], d: list[int], n: int, threshold: float) -> bool:
    l_s = len(s)
    l_d = len(d)
    H = set(ngram(s, n))
    for i in range(max(l_d - l_s, 0)):
        if d[i:i+n] in H:
            for j in range(max(i - l_s + n, 0), i):
                t = d[j:j+l_s]
                if Jaccard_W(s, t) >= threshold:
                    return True
    return False

You can use fast hash functions like fxhash or rolling hash.

When the size of $n$ of $n$-gram is small, fxhash is faster than rolling hash. However, when the size of $n$ is large, rolling hash is faster than fxhash because the rolling hash can calculate the hash value of the next $n$-gram in $O(1)$ time.

How to run

You can count the near-duplicates of queries in the document (sample queries (2 queries) and documents (100 documents) are in the sample_data folder) by running the following command:

$ cargo run --release -- --search-dir ./sample_data --query-path ./sample_data/query.jsonl --threshold 0.6 --n 10
[2024-08-07T10:59:40Z INFO  neardup] query_list_all: 2
[2024-08-07T10:59:40Z INFO  neardup] search_path_list len: 1
[2024-08-07T10:59:40Z INFO  neardup] path: "./sample_data/pythia-00000-00999.jsonl.gz" start loading token_ids_list
[2024-08-07T10:59:40Z INFO  neardup] loaded token_ids_list
[2024-08-07T10:59:40Z INFO  neardup] query: 0 count: 1
[2024-08-07T10:59:40Z INFO  neardup] query: 1 count: 1
[2024-08-07T10:59:40Z INFO  neardup] path idx: 0 finished
[2024-08-07T10:59:40Z INFO  neardup] count: [1, 1]

Count near-duplicates in the Pythia dataset

You can download the Pythia dataset from here After downloading the dataset, you can convert the dataset to the format that this program can read by running the following command:

$ python scripts/prepare_pythia.py --output_dir path/to/output/folder --pythia_data_path path/to/merged/folder/document

Then, you can count the near-duplicates of queries in the Pythia dataset by running the following command:

$ cargo run --release -- --search-dir path/to/output/folder --query-path path/to/output/folder/query.jsonl --threshold 0.6 --n 10

Benchmark

You can run the benchmark for the three methods:

  • Fast near-duplicate matching with fxhash
  • Fast near-duplicate matching with rolling hash
  • Naive near-duplicate matching
$ cargo bench

References

Commit count: 0

cargo fmt