sketch-duplicates

Crates.io	sketch-duplicates
lib.rs	sketch-duplicates
version	0.1.0
source	src
created_at	2020-07-05 17:42:13.375924
updated_at	2020-07-05 17:42:13.375924
description	Find duplicate lines probabilistically
homepage
repository	https://github.com/mpdn/sketch-duplicates
max_upload_size
id	261689
size	39,062

Mike Pedersen (mpdn)

documentation

README

sketch-duplicates

Find duplicate lines probabilistically.

Motivation

Let's say you have a directorty of gzipped text files that you want to check for duplicate lines in. The usual way to do this might look something like this:

zcat *.gz | sort | uniq -d

The problem with this, is that it can become very slow for large files. sketch-duplicates provides a way to remove most unique lines, leaving mostly duplicate lines in the output. sketch-duplicates is probabilistic and is therefore not guarenteed to remove all unique lines. It is therefore still necessary to have a sort | uniq -d in the end but this will be much faster due to the input having most unique lines removed.

The above can example can be written to use a sketch like this:

zcat *.gz | sketch-duplicates build > sketch
zcat *.gz | sketch-duplicates filter sketch | sort | uniq -d

Multiple sketches can be combined using sketch-duplicates combine. This can be used to parallelize the construction of the sketch (here using GNU Parallel):

echo *.gz | parallel 'zcat {} | sketch-duplicates build' | sketch-duplicates combine > sketch
echo *.gz | parallel 'zcat {} | sketch-duplicates filter sketch' | sort | uniq -d

Options

-s, --size: Size of the sketch. Increasing this improves filtering accuracy but consumes more memory. This is set to a conservative default of 8MiB and can often be increased depending on the specific use case.
-p, --probes: Number of probes to do in the sketch.
-0, --zero-terminated: Use NULL bytes as line delimiters.

Install

Install Cargo (eg. using rustup), then run cargo install sketch-duplicates.

Commit count: 1

sketch-duplicates

documentation

README

sketch-duplicates

Motivation

Options

Install

cargo fmt