Crates.io | find-identical-files |
lib.rs | find-identical-files |
version | 0.33.1 |
source | src |
created_at | 2024-04-11 13:45:39.976638 |
updated_at | 2024-09-27 12:27:38.503048 |
description | find identical files according to their size and hashing algorithm |
homepage | https://github.com/claudiofsr/find-identical-files |
repository | https://github.com/claudiofsr/find-identical-files |
max_upload_size | |
id | 1204939 |
size | 148,647 |
Find identical files according to their size and hashing algorithm.
Therefore, a file is identical to another if they both have the same size and hash.
"A hash function is a mathematical algorithm that takes an input (in this case, a file) and produces a fixed-size string of characters, known as a hash value or checksum. The hash value acts as a summary representation of the original input. This hash value is unique (disregarding unlikely collisions) to the input data, meaning even a slight change in the input will result in a completely different hash value."
To find identical files, 3 procedures were performed:
Procedure 1. Group files by size
.
Procedure 2. Group files by hash(first_bytes)
with ahash algorithm.
Procedure 3. Group files by hash(entire_file)
with chosen algorithm.
Hash algorithm options are:
find-identical-files just reads the files and never changes their contents. See the open_file function to verify.
find-identical-files
The number of identical files is the number of times the same file is found (number of repetitions or frequency).
By default, identical files will be filtered and those whose frequency is two (duplicates) or more will be selected.
find-identical-files -f N
such that N is an integer greater than or equal to 1 (N >= 1).
With the -f
(or --min_frequency
) argument option, set the minimum frequency (number of identical files).
With the -F
(or --max_frequency
) argument option, set the maximum frequency (number of identical files).
Useful for getting hash information for all files in the current directory.
find-identical-files -f 1
find-identical-files
or
find-identical-files -f 2
find-identical-files -f 4 -F 4
find-identical-files -b N
such that N is an integer (N >= 0).
With the -b
(or --min_size
) argument option, set the minimum size (in bytes).
With the -B
(or --max_size
) argument option, set the maximum size (in bytes).
find-identical-files -b 8
find-identical-files -B 1024
find-identical-files -b 8 -B 1024
find-identical-files -b 1024 -B 1024
fxhash
algorithm and yaml
format:find-identical-files -twa fxhash -r yaml
find-identical-files -c .
/tmp
directory:find-identical-files -c /tmp
or
find-identical-files --csv_dir=/tmp
~/Downloads
directory:find-identical-files -x ~/Downloads
/tmp
directory:find-identical-files -x /tmp
or
find-identical-files --xlsx_dir=/tmp
Downloads
directory with the ahash
algorithm, redirect the output to a json
file (/tmp/fif.json) and export the result to an XLSX file (/tmp/fif . xlsx) for further analysis:find-identical-files -tvi ~/Downloads -a ahash -r json > /tmp/fif.json -x /tmp
find-identical-files -r json | jq -sr '.[:-1].[].["File information"].hash'
find-identical-files -r json | jq -s '.[0]'
find-identical-files -r json | jq -s '.[14]'
For a = 2 and b = 5:
find-identical-files -r json | jq -s '.[2:5]'
find-identical-files -r json | jq -s '.[-1]'
Another option is to redirect the result to a temporary file and read specific information:
find-identical-files -vr json > /tmp/fif
jq -sr '.[:-1].[].["File information"].hash' /tmp/fif
jq -s '.[0]' /tmp/fif
jq -s '.[-2]' /tmp/fif
jq -s '.[-1]' /tmp/fif
jq -s '.[-1]["Total number of identical files"]' /tmp/fif
Type in the terminal find-identical-files -h
to see the help messages and all available options:
find identical files according to their size and hashing algorithm
Usage: find-identical-files [OPTIONS]
Options:
-a, --algorithm <ALGORITHM>
Choose the hash algorithm [default: blake3] [possible values: ahash, blake3, fxhash, sha256, sha512]
-b, --min_size <MIN_SIZE>
Set a minimum file size (in bytes) to search for identical files [default: 0]
-B, --max_size <MAX_SIZE>
Set a maximum file size (in bytes) to search for identical files
-c, --csv_dir <CSV_DIR>
Set the output directory for the CSV file (fif.csv)
-d, --min_depth <MIN_DEPTH>
Set the minimum depth to search for identical files [default: 0]
-D, --max_depth <MAX_DEPTH>
Set the maximum depth to search for identical files
-e, --extended_path
Prints extended path of identical files, otherwise relative path
-f, --min_frequency <MIN_FREQUENCY>
Minimum frequency (number of identical files) to be filtered [default: 2]
-F, --max_frequency <MAX_FREQUENCY>
Maximum frequency (number of identical files) to be filtered
-g, --generate <GENERATOR>
If provided, outputs the completion file for given shell [possible values: bash, elvish, fish, powershell, zsh]
-i, --input_dir <INPUT_DIR>
Set the input directory where to search for identical files [default: current directory]
-o, --omit_hidden
Omit hidden files (starts with '.'), otherwise search all files
-r, --result_format <RESULT_FORMAT>
Print the result in the chosen format [default: personal] [possible values: json, yaml, personal]
-s, --sort
Sort result by number of identical files, otherwise sort by file size
-t, --time
Show total execution time
-v, --verbose
Show intermediate runtime messages
-w, --wipe_terminal
Wipe (Clear) the terminal screen before listing the identical files
-x, --xlsx_dir <XLSX_DIR>
Set the output directory for the XLSX file (fif.xlsx)
-h, --help
Print help (see more with '--help')
-V, --version
Print version
To build and install from source, run the following command:
cargo install find-identical-files
Another option is to install from github:
cargo install --git https://github.com/claudiofsr/find-identical-files.git
In general, jwalk (default) is faster than walkdir.
But if you prefer to use walkdir:
cargo install --features walkdir find-identical-files