dedupefs

Crates.io	dedupefs
lib.rs	dedupefs
version	0.2.0
created_at	2025-08-03 23:34:33.281927+00
updated_at	2025-11-09 18:41:59.869451+00
description	Presents files as deduplicated, content-addressed 1MB chunks with selectable hash algorithms.
homepage
repository	https://github.com/FloGa/dedupefs
max_upload_size
id	1780172
size	100,683

Florian Gamboeck (FloGa)

documentation

README

DedupeFS

Presents files as deduplicated, content-addressed 1MB chunks with selectable hash algorithms.

DedupeFS is a FUSE filesystem over my Crazy Deduper application. It is so to speak the logical successor of SCFS. While SCFS presented each file as chunks, independent of each other, DedupeFS calculates the checksum of each chunk and collects them all in one directory. That way, each unique chunk is only presented once, even if it is used by multiple files.

DedupeFS is mainly useful to create efficient backups and upload them to a cloud provider. The file chunks have the advantage that the upload does not have to be all-or-nothing, so if your internet connection vanishes for a second, your 4GB file upload will not be completely cancelled, only the currently transferred chunk upload will be aborted.

By keeping multiple cache files around, you can easily and efficiently have incremental backups that all share the same chunks.

Installation

This tool can be installed easily through Cargo via crates.io:

cargo install --locked dedupefs

Please note that the --locked flag is necessary here to have the exact same dependencies as when the application was tagged and tested. Without it, you might get more up-to-date versions of dependencies, but you have the risk of undefined and unexpected behavior if the dependencies changed some functionalities. The application might even fail to build if the public API of a dependency changed too much.

Alternatively, pre-built binaries can be downloaded from the GitHub releases page.

Usage

Usage: dedupefs [OPTIONS] <SOURCE> <MOUNTPOINT>

Arguments:
  <SOURCE>
          Source directory

  <MOUNTPOINT>
          Mount point

Options:
      --cache-file <CACHE_FILE>
          Path to cache file
          
          Can be used multiple times. The files are read in reverse order, so they should be sorted with the most accurate ones in the beginning. The first given will be written.

      --hashing-algorithm <HASHING_ALGORITHM>
          Hashing algorithm to use for chunk filenames
          
          [default: sha1]
          [possible values: md5, sha1, sha256, sha512]

  -f, --foreground
          Stay in foreground, do not daemonize into the background

      --declutter-levels <DECLUTTER_LEVELS>
          Declutter files into this many subdirectory levels
          
          [default: 3]

      --reverse
          Reverse mode, present chunks re-hydrated

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

To mount a deduped version of source directory to deduped, you can use:

dedupefs --cache-file cache.json.zst source deduped

If the cache file ends with .zst, it will be encoded (or decoded in the case of hydrating) using the ZSTD compression algorithm. For any other extension, plain JSON will be used.

To mount a re-hydrated version of deduped directory to restored, you can use:

dedupefs --reverse --cache-file cache.json.zst deduped restored

Before mounting, it will be checked if all chunks present in the cache file are available in the deduped/data directory. If not, the mount will fail.

Cache Files

The cache file is necessary to keep track of all file chunks and hashes. Without the cache you would not be able to restore your files.

The cache file can be re-used, even if the source directory changed. It keeps track of the file sizes and modification times and only re-hashes new or changed files. Deleted files are deleted from the cache.

You can also use older cache files in addition to a new one:

dedupefs --cache-file cache.json.zst --cache-file cache-from-yesterday.json.zst source deduped

The cache files are read in reverse order in which they are given on the command line, so the content of earlier cache files is preferred over later ones. Hence, you should put your most accurate cache files to the beginning. Moreover, the first given cache file is the one that will be written to, it does not need to exist.

In the given example, if cache.json.zst does not exist, the internal cache is pre-filled from cache-from-yesterday.json.zst so that only new and modified files need to be re-hashed. The result is then written into cache.json.zst.

In the mounted deduped directory, the first cache file given on the command line will be presented with the same name directly under the mountpoint. next to the data directory. When uploading your chunks, always make sure to also upload this cache file, otherwise you wil not be able to properly re-hydrate your files afterward!

Helper Commands

There are several helper commands available to work with the cache files and to inspect the internal state of the deduplicated data chunks:

Check Cache

Check if cache file is valid and all chunks exist.

Usage: dedupefs_check_cache [OPTIONS] <SOURCE>

Arguments:
  <SOURCE>
          Source directory to deduped files

Options:
      --cache-file <CACHE_FILE>
          Path to cache file
          
          Can be used multiple times. The files are read in reverse order, so they should be sorted with the most accurate ones in the beginning. They will only be read, not written.

      --declutter-levels <DECLUTTER_LEVELS>
          Declutter files into this many subdirectory levels
          
          [default: 3]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Create Cache

Only create cache file without actually mounting.

Usage: dedupefs_create_cache [OPTIONS] <SOURCE>

Arguments:
  <SOURCE>
          Source directory

Options:
      --cache-file <CACHE_FILE>
          Path to cache file
          
          Can be used multiple times. The files are read in reverse order, so they should be sorted with the most accurate ones in the beginning. The first given will be written.

      --hashing-algorithm <HASHING_ALGORITHM>
          Hashing algorithm to use for chunk filenames
          
          [default: sha1]
          [possible values: md5, sha1, sha256, sha512]

      --declutter-levels <DECLUTTER_LEVELS>
          Declutter files into this many subdirectory levels
          
          [default: 3]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Delete Extra Files

Delete files not present in any cache files.

Usage: dedupefs_delete_extra_files [OPTIONS] <SOURCE>

Arguments:
  <SOURCE>
          Source directory

Options:
      --cache-file <CACHE_FILE>
          Path to cache file
          
          Can be used multiple times. The files are read in reverse order, so they should be sorted with the most accurate ones in the beginning. They will only be read, not written.

  -v
          List deleted files

  -f
          Force deletion

      --declutter-levels <DECLUTTER_LEVELS>
          Declutter files into this many subdirectory levels
          
          [default: 3]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

List Extra Files

List files not present in any cache files.

Usage: dedupefs_list_extra_files [OPTIONS] <SOURCE>

Arguments:
  <SOURCE>
          Source directory

Options:
      --cache-file <CACHE_FILE>
          Path to cache file
          
          Can be used multiple times. The files are read in reverse order, so they should be sorted with the most accurate ones in the beginning. They will only be read, not written.

  -0
          Separate file names with null character instead of newline

      --declutter-levels <DECLUTTER_LEVELS>
          Declutter files into this many subdirectory levels
          
          [default: 3]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

List Missing Chunks

List chunks from cache files that are not present in the source directory.

Usage: dedupefs_list_missing_chunks [OPTIONS] <SOURCE>

Arguments:
  <SOURCE>
          Source directory

Options:
      --cache-file <CACHE_FILE>
          Path to cache file
          
          Can be used multiple times. The files are read in reverse order, so they should be sorted with the most accurate ones in the beginning. They will only be read, not written.

      --with-reason
          Also display the reason for the missing or invalid chunk

  -0
          Separate file names with null character instead of newline

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

TODO

Make chunk size configurable (via Crazy Deduper, fixed to 1MB at the moment).
Provide better documentation with examples and use case descriptions.

Commit count: 17