[![LICENSE](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE.txt) [![](https://github.com/icsmw/fshasher/actions/workflows/push_and_pr_master.yml/badge.svg)](https://github.com/icsmw/fshasher/actions/workflows/push_and_pr_master.yml) ![Crates.io](https://img.shields.io/crates/v/fshasher) `fshasher` allows for quickly calculating a common hash for all files in a target folder (recursively). # Table of Contents 1. [Introduction](#introduction) - [What does it do](#what-does-it-do) - [Features](#features) - [Where can it be useful?](#where-can-it-be-useful) - [Basic example of usage](#basic-example-of-usage) 2. [Configuration](#configuration) - [General](#general) - [Filtering](#filtering) - [Patterns](#patterns) - [Rules in Files](#rules-in-files) - [Reading Strategy](#reading-strategy) 3. [Hasher & Reader](#hasher-and-reader) - [Default](#default) - [Hashers as Features](#hashers-as-features) - [Extending](#extending) 5. [Other](#other) - [Tracking](#tracking-changes) 6. [Behaviour, Errors, Logs](#behaviour-errors-logs) - [Error Handling](#error-handling) - [Why Errors Can Be Ignored?](#why-errors-can-be-ignored) - [Logs](#logs) # Introduction ## What does it do? `fshasher` performs two primary tasks: - **Collecting**: Gathering paths to files from the target folder. - **Hashing**: Calculating hashes for each file and a common hash for all of them. ## Features - `fshasher` spawns multiple threads for collecting files and further hashing, resulting in high speed; however, the performance depends on the file system and CPU performance (including the number of cores). - `fshasher` offers flexible configuration, allowing users to find the best compromise between performance and CPU/file system load. Different methods for reading files can be defined based on their sizes (chunk by chunk, complete reading, or memory-mapped files). `fshasher` also introduces the `Reader` and `Hasher` traits for implementing custom readers and hashers. - `fshasher` supports filtering files and folders, allowing the inclusion of only necessary files in the hash or the exclusion of others. Filtering is based on `glob` patterns. - `fshasher` performs expensive and continuous operations like hashing and allows for aborting/canceling collecting and hashing operations. - `fshasher` includes an embedded channel to share the progress of collecting files and hashing. - `fshasher` supports different levels of error tolerance, enabling the safe skipping of some files (e.g., due to permission issues) while still obtaining the hash of the remaining files. - `fshasher` with the "tracking" feature saves information about recent checks and detects changes with each subsequent calculation. ## Where can it be useful? General use cases for `fshasher` include: - **Build scripts/tasks**: To reduce unnecessary build steps by checking for changes in a folder and deciding whether to perform certain build actions. - **Tracking changes**: To quickly detect changes in target folders and trigger necessary actions. - **Other use cases**: Any actions that depend on the state of files in a target folder. ## Basic example of usage ``` use fshasher::{Options, Entry, Tolerance, hasher, reader}; use std::env::temp_dir; /// let mut walker = Options::new() .entry(Entry::from(temp_dir()).unwrap()).unwrap() .tolerance(Tolerance::LogErrors) .walker().unwrap(); let hash = walker.collect().unwrap() .hash::().unwrap(); println!("Hash of {}: {:?}", temp_dir().display(), hash); ``` # Configuration ## General To configure `fshasher`, use the `Options` struct. It provides several useful methods: - `reading_strategy(ReadingStrategy)` - Sets the reading strategy. - `threads(usize)` - Sets the number of system threads that the collector and hasher can spawn (default value is equal to the number of cores). - `progress(usize)` - Activates progress tracking; as an argument, you can define the capacity of the channel queue. - `tolerance(Tolerance)` - Sets tolerance to errors; by default, the collector and hasher will not stop working on errors but will report them. - `path(AsRef)` - Adds a destination folder to be included in hashing; includes the folder without filtering. - `entry(Entry)` - Adds a destination folder to be included in hashing; includes the folder with filtering. - `include(Filter)` - Adds a global positive filter for all entries. - `exclude(Filter)` - Adds a global negative filter for all entries. - `storage(AsRef)` - Available only with the "tracking" feature. Sets up a path to store data about recently calculated hashes. ## Filtering To set up global filters, which will be applied to all entries, use `Options.include(Filter)` and `Options.exclude(Filter)` to set positive and/or negative filters. For filtering, `fshasher` uses `glob` patterns. The following example: - Includes entry paths: "/music/2023" and "/music/2024". - Includes files with "star" in the name and with the "flac" extension. - Ignores files located in folders that have "Bieber" in the name. ```ignore let walker = Options::new() .path("/music/2023")? .path("/music/2024")? .include(Filter::Files("*star*"))? .include(Filter::Files("*.flac"))? .exclude("*Bieber*")?. .walker(..)?; ``` With `Filter`, a glob pattern can be applied to a file's name or a folder's name only, whereas a regular glob pattern is applied to the full path. This allows for more accurate filtering. - `Filter::Folders(AsRef)` - A glob pattern that will be applied to a folder's name only. - `Filter::Files(AsRef)` - A glob pattern that will be applied to a file's name only. - `Filter::Common(AsRef)` - A glob pattern that will be applied to the full path (regular usage of glob patterns). To create a filter linked to an entry, use `Entry`. The following example: - Takes entry paths: "/music/2023" and "/music/2024". - Includes files that have "star" in the name and have the extension "flac" in both entries. - Ignores files from "/music/2023" if they are located in folders that have "Bieber" in the name. - Ignores files from "/music/2024" if they are located in folders that have "Taylor Swift" in the name. ```ignore let music_2023 = Entry::from("music/2023")?.exclude(Filter::Folders("*Bieber*"))?; let music_2024 = Entry::from("music/2023")?.exclude(Filter::Folders("*Taylor Swift*"))?; let walker = Options::new() .entry(music_2023)? .entry(music_2024)? .include(Filter::Files("*star*"))? .include(Filter::Files("*.flac"))? .walker(..); ``` > **Note**: Exclude `Filter` has priority over include `Filter`. If an exclude `Filter` matches, the include `Filter` will not be checked. ## Patterns While `Filter` applies a `glob` pattern specifically to the filename or filepath, `PatternFilter` applies a `glob` pattern to the full path (filename including path), i.e., in the regular way of using `glob` patterns. - `PatternFilter::Ignore(AsRef)` - If the given glob pattern matches, the path will be ignored. - `PatternFilter::Accept(AsRef)` - If the given glob pattern matches, the path will be included. - `PatternFilter::Cmb(Vec>>)` - Allows defining a combination of `PatternFilter`. `PatternFilter::Cmb(..)` doesn't support nested combinations; attempting to nest another `PatternFilter::Cmb(..)` inside will cause an error. The following example: - Takes entry paths: "/music/2023" and "/music/2024". - Includes files with the extension "flac" **OR** "mp3" in both entries. - Ignores files from "/music/2023" if the full filename contains "Bieber". - Ignores files from "/music/2024" if the full filename contains "Taylor Swift". ```ignore let music_2023 = Entry::from("music/2023")? .pattern(PatternFilter::Accept("*.flac"))? .pattern(PatternFilter::Accept("*.mp3"))? .pattern(PatternFilter::Ignore("*Bieber*"))?; let music_2024 = Entry::from("music/2023")? .pattern(PatternFilter::Accept("*.flac"))? .pattern(PatternFilter::Accept("*.mp3"))? .pattern(PatternFilter::Ignore("*Taylor Swift*"))?; let walker = Options::new() .entry(music_2023)? .entry(music_2024)? .walker(..); ``` > **Note** `PatternFilter` has higher priority to `Filter`. If `PatternFilter` has been defined, any `Filter` will be ignored. One more variant of `PatternFilter` is `PatternFilter::Cmb(Vec>>)`. You can use it to combine `PatternFilter` with condition `AND`. Next example: - as entry paths takes: "/music/2023" and "/music/2024"; - collect files from "/music/2023" files with extention "flac" **AND** if full filename has not "Bieber"; - collect files from "/music/2024" files with extention "flac" **AND** if full filename has not "Taylor Swift"; ```ignore let music_2023 = Entry::from("music/2023")?.pattern(PatternFilter::Cmb(vec![ PatternFilter::Accept("*.flac"), PatternFilter::Ignore("*Bieber*"), ]))?; let music_2024 = Entry::from("music/2023")?.pattern(PatternFilter::Cmb(vec![ PatternFilter::Accept("*.flac"), PatternFilter::Ignore("*Taylor Swift*"), ]))?; let walker = Options::new() .entry(music_2023)? .entry(music_2024)? .walker(..); ``` Конечно! Вот улучшенная версия вашей документации в формате Markdown (MD): ## Rules in Files You can make `fshasher` consider rules from files (like `.gitignore`). `fshasher` will check for each folder's rule file and parse it to extract all glob patterns. ```ignore use fshasher::{Entry, Options, ContextFile}; use std::path::PathBuf; let mut opt = Options::new(); let mut walker = Options::new().entry( Entry::new() .entry(PathBuf::from("my/entry/path")) .unwrap() .context( ContextFile::Ignore(".gitignore") ) ).unwrap().walker(); ``` - **`ContextFile::Ignore`** - All rules in the file will be used as ignore rules. If the path matches, it will be ignored. Ignore rules are used regularly. This means the rule will be applied to the full path: both folder paths and file paths will be checked. - **`ContextFile::Accept`** - All rules in the file will be used as accept rules. If the path matches, it will be accepted. If this rule from the file doesn't match, the file will be ignored. Accept rules are used in a non-regular way. This means the rule will be applied only to file paths; folder path checks will be skipped. ## Reading Strategy Configuring a reading strategy helps optimize the hashing process to match a specific system's capabilities. On the one hand, the faster a file is read, the sooner its hashing can begin. On the other hand, hashing too much data at once can reduce performance or overload the CPU. To find a balance, the `ReadingStrategy` can be used. - `ReadingStrategy::Buffer` - Each file will be read in the "classic" way using a limited size buffer, chunk by chunk until the end. The hasher will receive small chunks of data to calculate the hash of the file. This strategy doesn't load the CPU much, but it entails many IO operations. - `ReadingStrategy::Complete` - With this strategy, the file will be read first, and the complete file's content will be passed to the hasher to calculate the hash. This strategy involves fewer IO operations but loads the CPU more. - `ReadingStrategy::MemoryMapped` - Instead of reading the file traditionally, this strategy maps the file into memory and provides the full content to the hasher. - `ReadingStrategy::Scenario(Vec<(Range, Box)>)` - The scenario strategy allows combining different strategies based on the file's size. In the following example: - Use the `ReadingStrategy::MemoryMapped` strategy for files smaller than 1024KB. - Use the `ReadingStrategy::Buffer` strategy for files larger than 1024KB. ``` use fshasher::{collector::Tolerance, hasher, reader, Options, ReadingStrategy}; use std::env::temp_dir; let mut walker = Options::from(temp_dir()) .unwrap() .reading_strategy(ReadingStrategy::Scenario(vec![ (0..1024 * 1024, Box::new(ReadingStrategy::MemoryMapped)), (1024 * 1024..u64::MAX, Box::new(ReadingStrategy::Buffer)), ])) .unwrap() .tolerance(Tolerance::LogErrors) .walker() .unwrap(); let hash = walker.collect() .unwrap() .hash::() .unwrap() .to_vec(); assert!(!hash.is_empty()); ``` > **Note**: There is a very small chance to find a way to increase performance using `ReadingStrategy`, but in terms of CPU load, the difference can be quite significant. # Hasher And Reader ## Default Out of the box, `fshasher` includes the following readers: - `reader::buffering::Buffering` - A "classic" reader that reads the file chunk by chunk until the end. It doesn't support mapping the file into memory (cannot be used with `ReadingStrategy::MemoryMapped`). - `reader::mapping::Mapping` - Supports mapping the file into memory (can be used with `ReadingStrategy::MemoryMapped`) and "classic" reading chunk by chunk until the end of the file. - `reader::md::Md` - Instead of reading the file, this reader creates a byte slice with the date of the last modification of the file and its size. Obviously, this reader will give very fast results, but it should be used only if you are sure that checking the metadata would be enough to make the right conclusion. `fshasher` includes only one hasher out of the box: - `hasher::blake::Blake` - A hasher based on the `blake3` crate. ## Hashers as Features Enabling `use_sha2` allows the use of the following hashers (based on the `sha2` crate): - `hasher::sha256::Sha256` - More versatile and often used in systems with more limited resources or where compatibility with 32-bit systems is required. - `hasher::sha512::Sha512` - Preferred for systems with a 64-bit architecture. ```toml [dependencies] fshasher = { version = "0.1", features = ["use_sha2"] } ``` ## Extending Implementing a custom `hasher` can be achieved by implementing the `Hasher` trait. Similarly, implementing a custom `reader` requires the implementation of the `Reader` trait. Here are a couple of examples: - [Custom Reader](examples/custom_reader) - [Custom Hasher](examples/custom_hasher) # Other ## Tracking Changes With the "tracking" feature, `fshasher` will create storage to save information about recently calculated hashes. Using the `is_same()` method, it will be possible to detect if any changes have occurred. Since the data is saved permanently on the disk, the `is_same()` method (in the `Walker` implementation) will provide accurate information between application runs. It's strongly recommended to set (using `Options`) your own path for `fshasher` to save data about recently calculated hashes. If a path isn't set, the default path `.fshasher` will be used, which might confuse users of your application. ```ignore use fshasher::{hasher, reader, Entry, Options, Tolerance, Tracking}; use std::env::temp_dir; /// let mut walker = Options::new() .entry(Entry::from(temp_dir()).unwrap()) .unwrap() .tolerance(Tolerance::LogErrors) .walker() .unwrap(); // false - because never checked before println!( "First check: {}", walker .is_same::() .unwrap() ); // true - because checked before println!( "Second check: {}", walker .is_same::() .unwrap() ); ``` # Behaviour, Errors, Logs ## Error Handling Hashing a large number of files can be unpredictable in some situations. For example, permission issues can cause errors, or a folder's content might change during the hash calculation. `fshasher` provides control over the tolerance to errors. It has the following levels: - `Tolerance::LogErrors`: Errors will be logged, but the collecting and hashing process will not be stopped. - `Tolerance::DoNotLogErrors`: Errors will be ignored, and the collecting and hashing process will not be stopped. - `Tolerance::StopOnErrors`: The collecting and hashing process will stop on any IO errors or errors related to the hasher or reader. ## Why Errors Can Be Ignored? If some files cause permission errors, it isn't a "problem" of the file collector, as the collector works in the given context with the given rights. If a user calculates the hash of a folder that includes subfolders without proper permissions, it might be the user's choice. Another situation is when the list of collected files changes during hash calculation. In this case, the `hash()` function can still return a hash that reflects the changes in any way (for example, if some file(s) have been removed). Meanwhile, the list of files that caused errors will be available in the `Walker`, but `HashItem` will include error instead hash of file. Ultimately, whether to ignore errors or not is up to the developer's choice. ## Logs `fshasher` uses the `log` crate, a lightweight logging facade for Rust. `log` is used in conjunction with `env_logger`. The following shell command will make some logs visible to you: ```sh export RUST_LOG=debug ``` # Contributing Contributions are welcome! Please read the short [Contributing Guide](CONTRIBUTING.md).