| Crates.io | crazy-deduper |
| lib.rs | crazy-deduper |
| version | 0.1.0 |
| created_at | 2025-08-03 22:51:33.913113+00 |
| updated_at | 2025-08-03 22:51:33.913113+00 |
| description | Deduplicates files into content-addressed chunks with selectable hash algorithms and restores them via a persistent cache. |
| homepage | |
| repository | https://github.com/FloGa/crazy-deduper |
| max_upload_size | |
| id | 1780162 |
| size | 84,311 |
Deduplicates files into content-addressed chunks with selectable hash algorithms and restores them via a persistent cache.
Crazy Deduper is a Rust tool that splits files into fixed-size chunks, identifies them using configurable hash algorithms (MD5, SHA1, SHA256, SHA512), and deduplicates redundant data into a content-addressed store. It maintains an incremental cache for speed, supports atomic cache updates, and can reverse the process (hydrate) to reconstruct original files. Optional decluttering of chunk paths and filesystem boundary awareness make it flexible for real-world workflows.
This crate is split into an Application part and a Library part.
This tool can be installed easily through Cargo via crates.io:
cargo install --locked crazy-deduper
Please note that the --locked flag is necessary here to have the exact same dependencies as when the application was
tagged and tested. Without it, you might get more up-to-date versions of dependencies, but you have the risk of
undefined and unexpected behavior if the dependencies changed some functionalities. The application might even fail to
build if the public API of a dependency changed too much.
Alternatively, pre-built binaries can be downloaded from the GitHub releases page.
Usage: crazy-deduper [OPTIONS] <SOURCE> <TARGET>
Arguments:
<SOURCE>
Source directory
<TARGET>
Target directory
Options:
--cache-file <CACHE_FILE>
Path to cache file
Can be used multiple times. The files are read in reverse order, so they should be sorted with the most accurate ones in the beginning. The first given will be written.
--hashing-algorithm <HASHING_ALGORITHM>
Hashing algorithm to use for chunk filenames
[default: sha1]
[possible values: md5, sha1, sha256, sha512]
--same-file-system
Limit file listing to same file system
--declutter-levels <DECLUTTER_LEVELS>
Declutter files into this many subdirectory levels
[default: 0]
-d, --decode
Invert behavior, restore tree from deduplicated data
[aliases: --hydrate]
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
To create a deduped version of source directory to deduped, you can use:
crazy-deduper --declutter-levels 3 --cache-file cache.json.zst source deduped
If the cache file ends with .zst, it will be encoded (or decoded in the case of hydrating) using the ZSTD compression
algorithm. For any other extension, plain JSON will be used.
To restore (hydrate) the directory again into the directory hydrated, you can use:
crazy-deduper --declutter-levels 3 --cache-file cache.json.zst deduped hydrated
Please note that for now you need to specify the same decluttering level as you did when deduping the source directory. This limitation will be lifted in a future version.
The cache file is necessary to keep track of all file chunks and hashes. Without the cache you would not be able to restore your files.
The cache file can be re-used, even if the source directory changed. It keeps track of the file sizes and modification times and only re-hashes new or changed files. Deleted files are deleted from the cache.
You can also use older cache files in addition to a new one:
crazy-deduper --cache-file cache.json.zst --cache-file cache-from-yesterday.json.zst source deduped
The cache files are read in reverse order in which they are given on the command line, so the content of earlier cache files is preferred over later ones. Hence, you should put your most accurate cache files to the beginning. Moreover, the first given cache file is the one that will be written to, it does not need to exist.
In the given example, if cache.json.zst does not exist, the internal cache is pre-filled from
cache-from-yesterday.json.zst so that only new and modified files need to be re-hashed. The result is then written
into cache.json.zst.
To add the crazy-deduper library to your project, you can use:
cargo add crazy-deduper
The following is a short summary of how this library is intended to be used.
This is an example of how to re-create the main functionality of the Application.
fn main() {
// Deduplicate
let mut deduper = crazy_deduper::Deduper::new(
"source",
vec!["cache.json.zst"],
crazy_deduper::HashingAlgorithm::MD5,
true,
);
deduper.write_chunks("deduped", 3).unwrap();
deduper.write_cache();
// Hydrate again
let hydrator = crazy_deduper::Hydrator::new("deduped", vec!["cache.json.zst"]);
hydrator.restore_files("hydrated", 3);
}
This method can be used if you want to implement your own logic and you only need the chunk objects.
fn main() {
let deduper = crazy_deduper::Deduper::new(
"source",
vec!["cache.json.zst"],
crazy_deduper::HashingAlgorithm::MD5,
true,
);
for (hash, chunk, dirty) in deduper.cache.get_chunks().unwrap() {
// Chunks and hashes are calculated on the fly, so you don't need to wait for the whole
// directory tree to be hashed.
println!("{hash:?}: {chunk:?}");
if dirty {
// This is just a simple example. Please do not write after every hash calculation, the
// IO overhead will slow things down dramatically. You should write only every 10
// seconds or so. Please be aware that you can kill the execution at any time. Since
// the cache will be written atomically and re-used on subsequent calls, you can
// terminate and resume at any point.
deduper.write_cache();
}
}
}