Deduplicator
Find, Sort, Filter & Delete duplicate files
## Usage
```bash
Usage: deduplicator [OPTIONS] [scan_dir_path]
Arguments:
[scan_dir_path] Run Deduplicator on dir different from pwd (e.g., ~/Pictures )
Options:
-t, --types Filetypes to deduplicate [default = all]
-i, --interactive Delete files interactively
-s, --min-size Minimum filesize of duplicates to scan (e.g., 100B/1K/2M/3G/4T) [default: 1b]
-d, --max-depth Max Depth to scan while looking for duplicates
--min-depth Min Depth to scan while looking for duplicates
-f, --follow-links Follow links while scanning directories
-h, --help Print help information
-V, --version Print version information
--json
```
### Examples
```bash
# Scan for duplicates recursively from the current dir, only look for png, jpg & pdf file types & interactively delete files
deduplicator -t pdf,jpg,png -i
# Scan for duplicates recursively from the ~/Pictures dir, only look for png, jpeg, jpg & pdf file types & interactively delete files
deduplicator ~/Pictures/ -t png,jpeg,jpg,pdf -i
# Scan for duplicates in the ~/Pictures without recursing into subdirectories
deduplicator ~/Pictures --max-depth 0
# look for duplicates in the ~/.config directory while also recursing into symbolic link paths
deduplicator ~/.config --follow-links
# scan for duplicates that are greater than 100mb in the ~/Media directory
deduplicator ~/Media --min-size 100mb
```
## Installation
### Cargo Install
#### Stable
```bash
$ cargo install deduplicator
```
#### Nightly
if you'd like to install with nightly features, you can use
```bash
$ cargo install --git https://github.com/sreedevk/deduplicator
```
Please note that if you use a version manager to install rust (like asdf), you need to reshim (`asdf reshim rust`).
### Linux (Pre-built Binary)
you can download the pre-built binary from the [Releases](https://github.com/sreedevk/deduplicator/releases) page.
download the `deduplicator-x86_64-unknown-linux-gnu.tar.gz` for linux. Once you have the tarball file with the executable,
you can follow these steps to install:
```bash
$ tar -zxvf deduplicator-x86_64-unknown-linux-gnu.tar.gz
$ sudo mv deduplicator /usr/bin/
```
### Mac OS (Pre-built Binary)
you can download the pre-build binary from the [Releases](https://github.com/sreedevk/deduplicator/releases) page.
download the `deduplicator-x86_64-apple-darwin.tar.gz` tarball for mac os. Once you have the tarball file with the executable, you can follow these steps to install:
```bash
$ tar -zxvf deduplicator-x86_64-unknown-linux-gnu.tar.gz
$ sudo mv deduplicator /usr/bin/
```
### Windows (Pre-built Binary)
you can download the pre-build binary from the [Releases](https://github.com/sreedevk/deduplicator/releases) page.
download the `deduplicator-x86_64-pc-windows-msvc.zip` zip file for windows. unzip the `zip` file & move the `deduplicator.exe` to a location in the PATH system environment variable.
Note: If you Run into an msvc error, please install MSCV from [here](https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170)
## Performance
Deduplicator uses size comparison and fxhash (a non non-cryptographic hashing algo) to quickly scan through large number of files to find duplicates. its also highly parallel (uses rayon and dashmap). I was able to scan through 120GB of files (Videos, PDFs, Images) in ~300ms. checkout the benchmarks
## benchmarks
| Command | Dirsize | Filecount | Mean [ms] | Min [ms] | Max [ms] | Relative |
|:---|:---|---:|---:|---:|---:|---:|
| `deduplicator ~/Data/tmp` | (~120G) | 721 files | 33.5 ± 28.6 | 25.3 | 151.5 | 1.87 ± 1.60 |
| `deduplicator ~/Data/books` | (~8.6G) | 1419 files | 24.5 ± 1.0 | 22.9 | 28.1 | 1.37 ± 0.08 |
| `deduplicator ~/Data/books --min-size 10M` | (~8.6G) | 1419 files | 17.9 ± 0.7 | 16.8 | 20.0 | 1.00 |
| `deduplicator ~/Data/ --types pdf,jpg,png,jpeg` | (~290G) | 104222 files | 1207.2 ± 37.0 | 1172.2 | 1287.7 | 67.27 ± 3.33 |
* The last entry is lower because of the number of files deduplicator had to go through (~660895 Files). The average size of the files rarely affect the performance of deduplicator.
These benchmarks were run using [hyperfine](https://github.com/sharkdp/hyperfine). Here are the specs of the machine used to benchmark deduplicator:
```
OS: Arch Linux x86_64
Host: Precision 5540
Kernel: 5.15.89-1-lts
Uptime: 4 hours, 44 mins
Shell: zsh 5.9
Terminal: kitty
CPU: Intel i9-9880H (16) @ 4.800GHz
GPU: NVIDIA Quadro T2000 Mobile / Max-Q
GPU: Intel CoffeeLake-H GT2 [UHD Graphics 630]
Memory: 31731MiB (~32GiB)
```
## Screenshots
![](https://user-images.githubusercontent.com/36154121/213618143-e5182e39-731e-4817-87dd-1a6a0f38a449.gif)
## Roadmap
- Tree format output for duplicate file listing
- GUI
- Packages for different operating system repositories (currently only installable via cargo)