Crates.io | sketch-duplicates |
lib.rs | sketch-duplicates |
version | 0.1.0 |
source | src |
created_at | 2020-07-05 17:42:13.375924 |
updated_at | 2020-07-05 17:42:13.375924 |
description | Find duplicate lines probabilistically |
homepage | |
repository | https://github.com/mpdn/sketch-duplicates |
max_upload_size | |
id | 261689 |
size | 39,062 |
Find duplicate lines probabilistically.
Let's say you have a directorty of gzipped text files that you want to check for duplicate lines in. The usual way to do this might look something like this:
zcat *.gz | sort | uniq -d
The problem with this, is that it can become very slow for large files. sketch-duplicates
provides
a way to remove most unique lines, leaving mostly duplicate lines in the output.
sketch-duplicates
is probabilistic and is therefore not guarenteed to remove all unique lines.
It is therefore still necessary to have a sort | uniq -d
in the end but this will be much faster
due to the input having most unique lines removed.
The above can example can be written to use a sketch like this:
zcat *.gz | sketch-duplicates build > sketch
zcat *.gz | sketch-duplicates filter sketch | sort | uniq -d
Multiple sketches can be combined using sketch-duplicates combine
. This can be used to parallelize
the construction of the sketch (here using GNU Parallel):
echo *.gz | parallel 'zcat {} | sketch-duplicates build' | sketch-duplicates combine > sketch
echo *.gz | parallel 'zcat {} | sketch-duplicates filter sketch' | sort | uniq -d
-s
, --size
: Size of the sketch. Increasing this improves filtering accuracy but consumes more
memory. This is set to a conservative default of 8MiB and can often be increased depending on the
specific use case.-p
, --probes
: Number of probes to do in the sketch.-0
, --zero-terminated
: Use NULL bytes as line delimiters.Install Cargo (eg. using rustup), then run
cargo install sketch-duplicates
.