| Crates.io | deduplicator |
| lib.rs | deduplicator |
| version | 0.3.1 |
| created_at | 2023-01-03 04:54:04.37238+00 |
| updated_at | 2025-07-19 13:46:18.038026+00 |
| description | find,filter and delete duplicate files |
| homepage | |
| repository | https://github.com/sreedevk/deduplicator |
| max_upload_size | |
| id | 749909 |
| size | 92,525 |
Find, Sort, Filter & Delete duplicate files
find,filter and delete duplicate files
Usage: deduplicator [OPTIONS] [scan_dir_path]
Arguments:
[scan_dir_path] Run Deduplicator on dir different from pwd (e.g., ~/Pictures )
Options:
-T, --exclude-types <EXCLUDE_TYPES> Exclude Filetypes [default = none]
-t, --types <TYPES> Filetypes to deduplicate [default = all]
-i, --interactive Delete files interactively
-m, --min-size <MIN_SIZE> Minimum filesize of duplicates to scan (e.g., 100B/1K/2M/3G/4T) [default: 1b]
-D, --max-depth <MAX_DEPTH> Max Depth to scan while looking for duplicates
-d, --min-depth <MIN_DEPTH> Min Depth to scan while looking for duplicates
-f, --follow-links Follow links while scanning directories
-s, --strict Guarantees that two files are duplicate (performs a full hash)
-p, --progress Show Progress spinners & metrics
-h, --help Print help
-V, --version Print version
# Scan for duplicates recursively from the current dir, only look for png, jpg & pdf file types & interactively delete files
deduplicator -t pdf,jpg,png -i
# Scan for duplicates recursively from current dir, exclude png and jpg file types
deduplicator -T jpg,png
# Scan for duplicates recursively from the ~/Pictures dir, only look for png, jpeg, jpg & pdf file types & interactively delete files
deduplicator ~/Pictures/ -t png,jpeg,jpg,pdf -i
# Scan for duplicates in the ~/Pictures without recursing into subdirectories
deduplicator ~/Pictures --max-depth 0
# look for duplicates in the ~/.config directory while also recursing into symbolic link paths
deduplicator ~/.config --follow-links
# scan for duplicates that are greater than 100mb in the ~/Media directory
deduplicator ~/Media --min-size 100mb
Currently, you can only install deduplicator using cargo package manager.
GxHash relies on aes hardware acceleration, so please set
RUSTFLAGSto"-C target-feature=+aes"or"-C target-cpu=native"before installing.
$ RUSTFLAGS="-C target-cpu=native" cargo install deduplicator
# or
$ RUSTFLAGS="-C target-feature=+aes,+sse2" cargo install deduplicator
$ RUSTFLAGS="-C target-cpu=native" cargo install deduplicator --git https://github.com/sreedevk/deduplicator
# or
$ RUSTFLAGS="-C target-feature=+aes,+sse2" cargo install --git https://github.com/sreedevk/deduplicator
Deduplicator uses size comparison and GxHash to quickly check a large number of files to find duplicates. its also heavily parallelized. The default behavior of deduplicator is to only hash the first page (4K) of the file. This is to ensure that performance is the default priority. You can modify this behavior by using the --strict flag which will hash the whole file and ensure that 2 files are indeed duplicates. I'll add benchmarks in future versions.
I've used hyperfine to run deduplicator on files generated by the rake file at rakelib/benchmark.rake. The Benchmarking accuracy can further be improved by isolating runs inside restricted docker containers. I'll include that in the future. For now, here's the hyperfine output on my i7-12800H laptop with 32G of RAM.
# hyperfine -N --warmup 80 './target/release/deduplicator bench_artifacts'
Benchmark 1: ./target/release/deduplicator bench_artifacts
Time (mean ± σ): 2.2 ms ± 0.4 ms [User: 2.2 ms, System: 4.4 ms]
Range (min … max): 1.3 ms … 7.1 ms 1522 runs
dust 'bench_artifacts'
54M ┌── file_0_fwds.bin │████ │ 2%
122M ├── file_1_fwds.bin │████████ │ 5%
390M ├── file_0_fwdcbss.bin│██████████████████████████ │ 15%
390M ├── file_0_fwscas.bin │██████████████████████████ │ 15%
390M ├── file_0_fwss.bin │██████████████████████████ │ 15%
390M ├── file_1_fwdcbss.bin│██████████████████████████ │ 15%
390M ├── file_1_fwscas.bin │██████████████████████████ │ 15%
390M ├── file_1_fwss.bin │██████████████████████████ │ 15%
2.5G ┌─┴ bench_artifacts │██████████████████████████████████████████████████████████████████ │ 100%
# hyperfine --warmup 20 './target/release/deduplicator bench_artifacts'
Benchmark 1: ./target/release/deduplicator bench_artifacts
Time (mean ± σ): 40.1 ms ± 2.3 ms [User: 251.0 ms, System: 277.3 ms]
Range (min … max): 35.0 ms … 45.9 ms 72 runs
dust 'bench_artifacts'
3.9M ┌── file_992_fwscas.bin │█ │ 0%
3.9M ├── file_992_fwss.bin │█ │ 0%
3.9M ├── file_993_fwdcbss.bin│█ │ 0%
3.9M ├── file_993_fwscas.bin │█ │ 0%
3.9M ├── file_993_fwss.bin │█ │ 0%
3.9M ├── file_994_fwdcbss.bin│█ │ 0%
3.9M ├── file_994_fwscas.bin │█ │ 0%
3.9M ├── file_994_fwss.bin │█ │ 0%
3.9M ├── file_995_fwdcbss.bin│█ │ 0%
3.9M ├── file_995_fwscas.bin │█ │ 0%
3.9M ├── file_995_fwss.bin │█ │ 0%
3.9M ├── file_996_fwdcbss.bin│█ │ 0%
3.9M ├── file_996_fwscas.bin │█ │ 0%
3.9M ├── file_996_fwss.bin │█ │ 0%
3.9M ├── file_997_fwdcbss.bin│█ │ 0%
3.9M ├── file_997_fwscas.bin │█ │ 0%
3.9M ├── file_997_fwss.bin │█ │ 0%
3.9M ├── file_998_fwdcbss.bin│█ │ 0%
3.9M ├── file_998_fwscas.bin │█ │ 0%
3.9M ├── file_998_fwss.bin │█ │ 0%
3.9M ├── file_999_fwdcbss.bin│█ │ 0%
3.9M ├── file_999_fwscas.bin │█ │ 0%
3.9M ├── file_999_fwss.bin │█ │ 0%
3.9M ├── file_99_fwdcbss.bin │█ │ 0%
3.9M ├── file_99_fwscas.bin │█ │ 0%
3.9M ├── file_99_fwss.bin │█ │ 0%
3.9M ├── file_9_fwdcbss.bin │█ │ 0%
3.9M ├── file_9_fwscas.bin │█ │ 0%
3.9M ├── file_9_fwss.bin │█ │ 0%
11G ┌─┴ bench_artifacts │████████████████████████████████████████████████████████████████ │ 100%
parallelization
max file path size should use the last set of duplicates
add more unit tests
restore json output (was removed in 0.3 due to quality issues)
fix memory leak on very large filesystems
tui
change the default hashing method to include the first & last page of a file (8K)
provide option to localize duplicate detection to arbitrary levels relative to current directory
bulk operations
fix: partial hash collision - a file full of null bytes ("\0") and an empty file. This is a known trade off in gxhash.