crazy-deduper

Crates.iocrazy-deduper
lib.rscrazy-deduper
version0.1.0
created_at2025-08-03 22:51:33.913113+00
updated_at2025-08-03 22:51:33.913113+00
descriptionDeduplicates files into content-addressed chunks with selectable hash algorithms and restores them via a persistent cache.
homepage
repositoryhttps://github.com/FloGa/crazy-deduper
max_upload_size
id1780162
size84,311
Florian Gamboeck (FloGa)

documentation

README

Crazy Deduper

badge github badge crates.io badge docs.rs badge license

Deduplicates files into content-addressed chunks with selectable hash algorithms and restores them via a persistent cache.

Crazy Deduper is a Rust tool that splits files into fixed-size chunks, identifies them using configurable hash algorithms (MD5, SHA1, SHA256, SHA512), and deduplicates redundant data into a content-addressed store. It maintains an incremental cache for speed, supports atomic cache updates, and can reverse the process (hydrate) to reconstruct original files. Optional decluttering of chunk paths and filesystem boundary awareness make it flexible for real-world workflows.

This crate is split into an Application part and a Library part.

Application

Installation

This tool can be installed easily through Cargo via crates.io:

cargo install --locked crazy-deduper

Please note that the --locked flag is necessary here to have the exact same dependencies as when the application was tagged and tested. Without it, you might get more up-to-date versions of dependencies, but you have the risk of undefined and unexpected behavior if the dependencies changed some functionalities. The application might even fail to build if the public API of a dependency changed too much.

Alternatively, pre-built binaries can be downloaded from the GitHub releases page.

Usage

Usage: crazy-deduper [OPTIONS] <SOURCE> <TARGET>

Arguments:
  <SOURCE>
          Source directory

  <TARGET>
          Target directory

Options:
      --cache-file <CACHE_FILE>
          Path to cache file
          
          Can be used multiple times. The files are read in reverse order, so they should be sorted with the most accurate ones in the beginning. The first given will be written.

      --hashing-algorithm <HASHING_ALGORITHM>
          Hashing algorithm to use for chunk filenames
          
          [default: sha1]
          [possible values: md5, sha1, sha256, sha512]

      --same-file-system
          Limit file listing to same file system

      --declutter-levels <DECLUTTER_LEVELS>
          Declutter files into this many subdirectory levels
          
          [default: 0]

  -d, --decode
          Invert behavior, restore tree from deduplicated data
          
          [aliases: --hydrate]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

To create a deduped version of source directory to deduped, you can use:

crazy-deduper --declutter-levels 3 --cache-file cache.json.zst source deduped

If the cache file ends with .zst, it will be encoded (or decoded in the case of hydrating) using the ZSTD compression algorithm. For any other extension, plain JSON will be used.

To restore (hydrate) the directory again into the directory hydrated, you can use:

crazy-deduper --declutter-levels 3 --cache-file cache.json.zst deduped hydrated

Please note that for now you need to specify the same decluttering level as you did when deduping the source directory. This limitation will be lifted in a future version.

Cache Files

The cache file is necessary to keep track of all file chunks and hashes. Without the cache you would not be able to restore your files.

The cache file can be re-used, even if the source directory changed. It keeps track of the file sizes and modification times and only re-hashes new or changed files. Deleted files are deleted from the cache.

You can also use older cache files in addition to a new one:

crazy-deduper --cache-file cache.json.zst --cache-file cache-from-yesterday.json.zst source deduped

The cache files are read in reverse order in which they are given on the command line, so the content of earlier cache files is preferred over later ones. Hence, you should put your most accurate cache files to the beginning. Moreover, the first given cache file is the one that will be written to, it does not need to exist.

In the given example, if cache.json.zst does not exist, the internal cache is pre-filled from cache-from-yesterday.json.zst so that only new and modified files need to be re-hashed. The result is then written into cache.json.zst.

Library

Installation

To add the crazy-deduper library to your project, you can use:

cargo add crazy-deduper

Usage

The following is a short summary of how this library is intended to be used.

Copy Chunks

This is an example of how to re-create the main functionality of the Application.

fn main() {
    // Deduplicate
    let mut deduper = crazy_deduper::Deduper::new(
        "source",
        vec!["cache.json.zst"],
        crazy_deduper::HashingAlgorithm::MD5,
        true,
    );
    deduper.write_chunks("deduped", 3).unwrap();
    deduper.write_cache();

    // Hydrate again
    let hydrator = crazy_deduper::Hydrator::new("deduped", vec!["cache.json.zst"]);
    hydrator.restore_files("hydrated", 3);
}

Get File Chunks as an Iterator

This method can be used if you want to implement your own logic and you only need the chunk objects.

fn main() {
    let deduper = crazy_deduper::Deduper::new(
        "source",
        vec!["cache.json.zst"],
        crazy_deduper::HashingAlgorithm::MD5,
        true,
    );

    for (hash, chunk, dirty) in deduper.cache.get_chunks().unwrap() {
        // Chunks and hashes are calculated on the fly, so you don't need to wait for the whole
        // directory tree to be hashed.
        println!("{hash:?}: {chunk:?}");
        if dirty {
            // This is just a simple example. Please do not write after every hash calculation, the
            // IO overhead will slow things down dramatically. You should write only every 10
            // seconds or so. Please be aware that you can kill the execution at any time. Since
            // the cache will be written atomically and re-used on subsequent calls, you can
            // terminate and resume at any point.
            deduper.write_cache();
        }
    }
}
Commit count: 0

cargo fmt