# hashdeep-compare
hashdeep-compare is a comparison tool for log files generated by the Hashdeep file storage auditing tool.

### Why use hashdeep-compare? Isn't Hashdeep enough?
Hashdeep can generate a hash digest report for every file in a storage volume or directory, making it suitable for forensic recording, bit-rot detection, or confirming directory and file changes before committing them to backups. Multiple logs of Hashdeep's output can be saved to form a historical record of a storage volume's state. Hashdeep supports the comparison of a log file to a live storage volume, but does not support comparison between log files. hashdeep-compare was created to provide this log comparison capability.

By saving log files and using hashdeep-compare, the contents of storage volumes can be compared regardless of their current availability. This allows retrospective analysis of historical hashdeep logs to compare file and directory states at different times, which supports new use cases, e.g.: determining when a file was moved, or confirming that a modified or corrupted file was still intact at a certain date. 

If you're concerned about file archive bit-rot or just want to compare archived records of the content of an important directory, using Hashdeep and hashdeep-compare may be a convenient solution.

### How to use hashdeep-compare
hashdeep-compare is a command-line tool with four functions:
* `hash`: invokes hashdeep and generates a log file compatible with hashdeep-compare.
    
    `hashdeep-compare hash path/to/target_dir path/to/output_log.txt`
    
    This function is optional, but recommended to ensure log compatibility. The above function call is equivalent to directly calling 
    `hashdeep -l -r -o f path/to/target_dir > path/to/output_log.txt 2> path/to/output_log.txt.errors`. Note that if the output file or the error file already exists, the command will be aborted (hashdeep-compare will not overwrite existing files).
    
* `sort`: sorts the entries in a hashdeep log by file path.

    `hashdeep-compare sort path/to/unsorted_input.txt path/to/sorted_output.txt`
    
    hashdeep does not guarantee ordering of log entries, and ordering tends to be inconsistent between runs in practice. Sorting allows comparison of hashdeep logs in a text-diff tool, which may be the easiest way to compare logs with uncomplicated differences. Note that if the output file already exists, the command will be aborted (hashdeep-compare will not overwrite existing files).

* `root`: changes a hashdeep log root by removing a prefix from its filepaths.
    Any entries with filepaths that do not start with the prefix will be
    omitted from the output.

    `hashdeep-compare root path/to/input.txt path/to/output.txt filepath/prefix/`

    This subcommand is an easy way to recover from a hashdeep run that prepended
    unintended parent directories on all of its filepaths because of its invocation
    directory.

    Warning: The prefix is applied as simple text, without any rules related to paths.
    If the prefix "test" were used on the filepath "testdir/file.txt",
    the resulting filepath would be "dir/file.txt".
    Splitting the text of a path component like this probably isn't what you want,
    but there may be some clever uses for it.

    Note that if the output file already exists, the command will be aborted
    (hashdeep-compare will not overwrite existing files).


* `part`: the real power of hashdeep-compare: all entries will be partitioned into sets that efficiently describe the similarities and differences of the two log files.

    `hashdeep-compare part path/to/first_log.txt path/to/second_log.txt path/to/output_file_base`
    
    The output file base path will be used to name the output files by adding suffixes that describe the log entries represented within; it may include subdirectories. Nonexistent subdirectories will not be created; if one is specified, the command will be aborted. Note that if any of the resulting output files already exist, the command will be aborted (hashdeep-compare will not overwrite existing files).

### The partitioning algorithm

When invoked with the recommended settings, Hashdeep creates a one-line log entry for each file that looks something like this:

`3364240,aff470b119f69a7ad5e6999e5e6a3346,bf4fdd9d86cf23e66b456827b5dfe6e2ae52ebc9f32c7de6623aca7b665b3337,./path/example_filename.ext`

This is a comma-separated string of the file's attributes: its size in bytes, the MD5 hash, the SHA256 hash, and the file path. The first three items identify the file's contents (with two separate hash algorithms to protect against hash collisions). If all three are the same for the entries of two different files, hashdeep-compare determines that the files have the same content. If at least one is different, they have different content.

##### Definitions:
* entry: a single line in a Hashdeep log which records a single file from its target volume
* hashes: an entry's file size, MD5, and SHA256 (the first 3 parts of the entry line)
* name: an entry's file path (the last part of the entry line)
* match: a selection of entries matched by the algorithm
* match pair: a match of exactly one entry from each of the two input files
* match group: a match of entries from either or both input files, but not a match pair

The hashdeep-compare partitioning algorithm compares all of the file entries from the two input logs and organizes them based on matching hashes and/or names. 

When the partitioning algorithm starts, all of the entries in both input logs are loaded into a working set. Match rules are applied in a fixed order, and as matches are identified, the matched entries are removed from the working set. When the algorithm finishes, every entry will have been partitioned into exactly one match, or into one of two special sets of unmatched leftover entries.

When input logs 1 and 2 are earlier and later (respectively) records of the same file volume, these match types can imply the type of file change that was made between the creations of the logs.

Match rules, in order, with implied file changes:
1. Full match pairs: unchanged files
1. Full match groups: should never happen (duplicate names imply invalid Hashdeep logs)
1. Name match pairs: modified files
1. Name match groups: should never happen (duplicate names imply invalid Hashdeep logs)
1. Hashes match pairs: moved/renamed files
1. Hashes match groups (entries from both logs): ambiguous rename/move/copy/delete
1. Hashes match groups (entries only from log 1): duplicate files deleted
1. Hashes match groups (entries only from log 2): duplicate files created

After the match rules have been run, no more matching names or hashes will exist among the remaining entries.
1. unmatchable (entry from log 1): deleted files
1. unmatchable (entry from log 2): created files


Because each log entry is represented in exactly one match or unmatchable set, the algorithm results represent the total content of the two input logs.

The results are stored in separate files for each match rule, plus two files for unmatchable entries. These files are created by adding the following suffixes to the output file base parameter supplied to the `part` command:
* _full_match_pairs
* _full_match_groups_file1_only
* _full_match_groups_file2_only
* _full_match_groups_file1_and_file2
* _name_match_pairs
* _name_match_groups_file1_only
* _name_match_groups_file2_only
* _name_match_groups_file1_and_file2
* _hashes_match_pairs
* _hashes_match_groups_file1_only
* _hashes_match_groups_file2_only
* _hashes_match_groups_file1_and_file2
* _no_match_entries_file1
* _no_match_entries_file2

Because each category is written to its own output file, you can use any text editor to analyze the results, and quickly confirm that any category that should be empty actually is (i.e.: has an empty output file).

### Supplemental: handling of partially-invalid input logs
When reading a hashdeep log, hashdeep-compare performs two content checks:
* In the log header: the line count, hashdeep version, and recorded log format are confirmed. If these are not identical to what the hashdeep-compare test suite uses, a warning is issued. This is intended to warn the user if a different version of hashdeep (or something else) may have generated a log file that might lead to unexpected results.
* Each log entry line is checked for correct formatting: incorrectly-formatted lines are ignored by hashdeep-compare. If any are found, the number of these ignored lines is reported in a warning message.
  
(Note: These checks are here for extra safety. ~~I've never seen hashdeep generate an invalid line~~: if you have one of these, you should probably figure out why before you rely on the output.)

(Update 2024.3.5: hashdeep will generate an invalid line by adding newlines that occur in filenames. Because of the line-based structure of the hashdeep output format, there may not be an elegant way to add support for these files to hashdeep-compare.)

Regardless of how many warnings are generated, hashdeep-compare will always use all of the correctly-formatted entries to produce the requested output. Warnings, by themselves, will never prevent hashdeep-compare from running to successful completion.