sketch-duplicates

Crates.iosketch-duplicates
lib.rssketch-duplicates
version0.1.0
sourcesrc
created_at2020-07-05 17:42:13.375924
updated_at2020-07-05 17:42:13.375924
descriptionFind duplicate lines probabilistically
homepage
repositoryhttps://github.com/mpdn/sketch-duplicates
max_upload_size
id261689
size39,062
Mike Pedersen (mpdn)

documentation

README

sketch-duplicates

Find duplicate lines probabilistically.

Motivation

Let's say you have a directorty of gzipped text files that you want to check for duplicate lines in. The usual way to do this might look something like this:

zcat *.gz | sort | uniq -d

The problem with this, is that it can become very slow for large files. sketch-duplicates provides a way to remove most unique lines, leaving mostly duplicate lines in the output. sketch-duplicates is probabilistic and is therefore not guarenteed to remove all unique lines. It is therefore still necessary to have a sort | uniq -d in the end but this will be much faster due to the input having most unique lines removed.

The above can example can be written to use a sketch like this:

zcat *.gz | sketch-duplicates build > sketch
zcat *.gz | sketch-duplicates filter sketch | sort | uniq -d

Multiple sketches can be combined using sketch-duplicates combine. This can be used to parallelize the construction of the sketch (here using GNU Parallel):

echo *.gz | parallel 'zcat {} | sketch-duplicates build' | sketch-duplicates combine > sketch
echo *.gz | parallel 'zcat {} | sketch-duplicates filter sketch' | sort | uniq -d

Options

  • -s, --size: Size of the sketch. Increasing this improves filtering accuracy but consumes more memory. This is set to a conservative default of 8MiB and can often be increased depending on the specific use case.
  • -p, --probes: Number of probes to do in the sketch.
  • -0, --zero-terminated: Use NULL bytes as line delimiters.

Install

Install Cargo (eg. using rustup), then run cargo install sketch-duplicates.

Commit count: 1

cargo fmt