Crates.io | fshasher |
lib.rs | fshasher |
version | 0.3.2 |
source | src |
created_at | 2024-06-16 00:41:23.882844 |
updated_at | 2024-07-21 20:13:39.028583 |
description | Scan the destination folder and make a hash of all files to get the current state of the directory |
homepage | https://github.com/icsmw/fshasher |
repository | https://github.com/icsmw/fshasher.git |
max_upload_size | |
id | 1273218 |
size | 230,937 |
fshasher
allows for quickly calculating a common hash for all files in a target folder (recursively).
fshasher
performs two primary tasks:
fshasher
spawns multiple threads for collecting files and further hashing, resulting in high speed; however, the performance depends on the file system and CPU performance (including the number of cores).fshasher
offers flexible configuration, allowing users to find the best compromise between performance and CPU/file system load. Different methods for reading files can be defined based on their sizes (chunk by chunk, complete reading, or memory-mapped files). fshasher
also introduces the Reader
and Hasher
traits for implementing custom readers and hashers.fshasher
supports filtering files and folders, allowing the inclusion of only necessary files in the hash or the exclusion of others. Filtering is based on glob
patterns.fshasher
performs expensive and continuous operations like hashing and allows for aborting/canceling collecting and hashing operations.fshasher
includes an embedded channel to share the progress of collecting files and hashing.fshasher
supports different levels of error tolerance, enabling the safe skipping of some files (e.g., due to permission issues) while still obtaining the hash of the remaining files.fshasher
with the "tracking" feature saves information about recent checks and detects changes with each subsequent calculation.General use cases for fshasher
include:
use fshasher::{Options, Entry, Tolerance, hasher, reader};
use std::env::temp_dir;
///
let mut walker = Options::new()
.entry(Entry::from(temp_dir()).unwrap()).unwrap()
.tolerance(Tolerance::LogErrors)
.walker().unwrap();
let hash = walker.collect().unwrap()
.hash::<hasher::blake::Blake, reader::buffering::Buffering>().unwrap();
println!("Hash of {}: {:?}", temp_dir().display(), hash);
To configure fshasher
, use the Options
struct. It provides several useful methods:
reading_strategy(ReadingStrategy)
- Sets the reading strategy.threads(usize)
- Sets the number of system threads that the collector and hasher can spawn (default value is equal to the number of cores).progress(usize)
- Activates progress tracking; as an argument, you can define the capacity of the channel queue.tolerance(Tolerance)
- Sets tolerance to errors; by default, the collector and hasher will not stop working on errors but will report them.path(AsRef<Path>)
- Adds a destination folder to be included in hashing; includes the folder without filtering.entry(Entry)
- Adds a destination folder to be included in hashing; includes the folder with filtering.include(Filter)
- Adds a global positive filter for all entries.exclude(Filter)
- Adds a global negative filter for all entries.storage(AsRef<Path>)
- Available only with the "tracking" feature. Sets up a path to store data about recently calculated hashes.To set up global filters, which will be applied to all entries, use Options.include(Filter)
and Options.exclude(Filter)
to set positive and/or negative filters. For filtering, fshasher
uses glob
patterns.
The following example:
Includes entry paths: "/music/2023" and "/music/2024".
Includes files with "star" in the name and with the "flac" extension.
Ignores files located in folders that have "Bieber" in the name.
let walker = Options::new()
.path("/music/2023")?
.path("/music/2024")?
.include(Filter::Files("*star*"))?
.include(Filter::Files("*.flac"))?
.exclude("*Bieber*")?.
.walker(..)?;
With Filter
, a glob pattern can be applied to a file's name or a folder's name only, whereas a regular glob pattern is applied to the full path. This allows for more accurate filtering.
Filter::Folders(AsRef<str>)
- A glob pattern that will be applied to a folder's name only.Filter::Files(AsRef<str>)
- A glob pattern that will be applied to a file's name only.Filter::Common(AsRef<str>)
- A glob pattern that will be applied to the full path (regular usage of glob patterns).To create a filter linked to an entry, use Entry
.
The following example:
let music_2023 = Entry::from("music/2023")?.exclude(Filter::Folders("*Bieber*"))?;
let music_2024 = Entry::from("music/2023")?.exclude(Filter::Folders("*Taylor Swift*"))?;
let walker = Options::new()
.entry(music_2023)?
.entry(music_2024)?
.include(Filter::Files("*star*"))?
.include(Filter::Files("*.flac"))?
.walker(..);
Note: Exclude
Filter
has priority over includeFilter
. If an excludeFilter
matches, the includeFilter
will not be checked.
While Filter
applies a glob
pattern specifically to the filename or filepath, PatternFilter
applies a glob
pattern to the full path (filename including path), i.e., in the regular way of using glob
patterns.
PatternFilter::Ignore(AsRef<str>)
- If the given glob pattern matches, the path will be ignored.PatternFilter::Accept(AsRef<str>)
- If the given glob pattern matches, the path will be included.PatternFilter::Cmb(Vec<PatternFilter<AsRef<str>>>)
- Allows defining a combination of PatternFilter
. PatternFilter::Cmb(..)
doesn't support nested combinations; attempting to nest another PatternFilter::Cmb(..)
inside will cause an error.The following example:
let music_2023 = Entry::from("music/2023")?
.pattern(PatternFilter::Accept("*.flac"))?
.pattern(PatternFilter::Accept("*.mp3"))?
.pattern(PatternFilter::Ignore("*Bieber*"))?;
let music_2024 = Entry::from("music/2023")?
.pattern(PatternFilter::Accept("*.flac"))?
.pattern(PatternFilter::Accept("*.mp3"))?
.pattern(PatternFilter::Ignore("*Taylor Swift*"))?;
let walker = Options::new()
.entry(music_2023)?
.entry(music_2024)?
.walker(..);
Note
PatternFilter
has higher priority toFilter
. IfPatternFilter
has been defined, anyFilter
will be ignored.
One more variant of PatternFilter
is PatternFilter::Cmb(Vec<PatternFilter<AsRef<str>>>)
. You can use it to combine PatternFilter
with condition AND
.
Next example:
let music_2023 = Entry::from("music/2023")?.pattern(PatternFilter::Cmb(vec![
PatternFilter::Accept("*.flac"),
PatternFilter::Ignore("*Bieber*"),
]))?;
let music_2024 = Entry::from("music/2023")?.pattern(PatternFilter::Cmb(vec![
PatternFilter::Accept("*.flac"),
PatternFilter::Ignore("*Taylor Swift*"),
]))?;
let walker = Options::new()
.entry(music_2023)?
.entry(music_2024)?
.walker(..);
Конечно! Вот улучшенная версия вашей документации в формате Markdown (MD):
You can make fshasher
consider rules from files (like .gitignore
). fshasher
will check for each folder's rule file and parse it to extract all glob patterns.
use fshasher::{Entry, Options, ContextFile};
use std::path::PathBuf;
let mut opt = Options::new();
let mut walker = Options::new().entry(
Entry::new()
.entry(PathBuf::from("my/entry/path"))
.unwrap()
.context(
ContextFile::Ignore(".gitignore")
)
).unwrap().walker();
ContextFile::Ignore
- All rules in the file will be used as ignore rules. If the path matches, it will be ignored. Ignore rules are used regularly. This means the rule will be applied to the full path: both folder paths and file paths will be checked.ContextFile::Accept
- All rules in the file will be used as accept rules. If the path matches, it will be accepted. If this rule from the file doesn't match, the file will be ignored. Accept rules are used in a non-regular way. This means the rule will be applied only to file paths; folder path checks will be skipped.Configuring a reading strategy helps optimize the hashing process to match a specific system's capabilities. On the one hand, the faster a file is read, the sooner its hashing can begin. On the other hand, hashing too much data at once can reduce performance or overload the CPU. To find a balance, the ReadingStrategy
can be used.
ReadingStrategy::Buffer
- Each file will be read in the "classic" way using a limited size buffer, chunk by chunk until the end. The hasher will receive small chunks of data to calculate the hash of the file. This strategy doesn't load the CPU much, but it entails many IO operations.ReadingStrategy::Complete
- With this strategy, the file will be read first, and the complete file's content will be passed to the hasher to calculate the hash. This strategy involves fewer IO operations but loads the CPU more.ReadingStrategy::MemoryMapped
- Instead of reading the file traditionally, this strategy maps the file into memory and provides the full content to the hasher.ReadingStrategy::Scenario(Vec<(Range<u64>, Box<ReadingStrategy>)>)
- The scenario strategy allows combining different strategies based on the file's size.In the following example:
ReadingStrategy::MemoryMapped
strategy for files smaller than 1024KB.ReadingStrategy::Buffer
strategy for files larger than 1024KB. use fshasher::{collector::Tolerance, hasher, reader, Options, ReadingStrategy};
use std::env::temp_dir;
let mut walker = Options::from(temp_dir())
.unwrap()
.reading_strategy(ReadingStrategy::Scenario(vec![
(0..1024 * 1024, Box::new(ReadingStrategy::MemoryMapped)),
(1024 * 1024..u64::MAX, Box::new(ReadingStrategy::Buffer)),
]))
.unwrap()
.tolerance(Tolerance::LogErrors)
.walker()
.unwrap();
let hash = walker.collect()
.unwrap()
.hash::<hasher::blake::Blake, reader::mapping::Mapping>()
.unwrap()
.to_vec();
assert!(!hash.is_empty());
Note: There is a very small chance to find a way to increase performance using
ReadingStrategy
, but in terms of CPU load, the difference can be quite significant.
Out of the box, fshasher
includes the following readers:
reader::buffering::Buffering
- A "classic" reader that reads the file chunk by chunk until the end. It doesn't support mapping the file into memory (cannot be used with ReadingStrategy::MemoryMapped
).reader::mapping::Mapping
- Supports mapping the file into memory (can be used with ReadingStrategy::MemoryMapped
) and "classic" reading chunk by chunk until the end of the file.reader::md::Md
- Instead of reading the file, this reader creates a byte slice with the date of the last modification of the file and its size. Obviously, this reader will give very fast results, but it should be used only if you are sure that checking the metadata would be enough to make the right conclusion.fshasher
includes only one hasher out of the box:
hasher::blake::Blake
- A hasher based on the blake3
crate.Enabling use_sha2
allows the use of the following hashers (based on the sha2
crate):
hasher::sha256::Sha256
- More versatile and often used in systems with more limited resources or where compatibility with 32-bit systems is required.hasher::sha512::Sha512
- Preferred for systems with a 64-bit architecture.[dependencies]
fshasher = { version = "0.1", features = ["use_sha2"] }
Implementing a custom hasher
can be achieved by implementing the Hasher
trait. Similarly, implementing a custom reader
requires the implementation of the Reader
trait.
Here are a couple of examples:
With the "tracking" feature, fshasher
will create storage to save information about recently calculated hashes. Using the is_same()
method, it will be possible to detect if any changes have occurred.
Since the data is saved permanently on the disk, the is_same()
method (in the Walker
implementation) will provide accurate information between application runs.
It's strongly recommended to set (using Options
) your own path for fshasher
to save data about recently calculated hashes. If a path isn't set, the default path .fshasher
will be used, which might confuse users of your application.
use fshasher::{hasher, reader, Entry, Options, Tolerance, Tracking};
use std::env::temp_dir;
///
let mut walker = Options::new()
.entry(Entry::from(temp_dir()).unwrap())
.unwrap()
.tolerance(Tolerance::LogErrors)
.walker()
.unwrap();
// false - because never checked before
println!(
"First check: {}",
walker
.is_same::<hasher::blake::Blake, reader::buffering::Buffering>()
.unwrap()
);
// true - because checked before
println!(
"Second check: {}",
walker
.is_same::<hasher::blake::Blake, reader::buffering::Buffering>()
.unwrap()
);
Hashing a large number of files can be unpredictable in some situations. For example, permission issues can cause errors, or a folder's content might change during the hash calculation. fshasher
provides control over the tolerance to errors. It has the following levels:
Tolerance::LogErrors
: Errors will be logged, but the collecting and hashing process will not be stopped.Tolerance::DoNotLogErrors
: Errors will be ignored, and the collecting and hashing process will not be stopped.Tolerance::StopOnErrors
: The collecting and hashing process will stop on any IO errors or errors related to the hasher or reader.If some files cause permission errors, it isn't a "problem" of the file collector, as the collector works in the given context with the given rights. If a user calculates the hash of a folder that includes subfolders without proper permissions, it might be the user's choice.
Another situation is when the list of collected files changes during hash calculation. In this case, the hash()
function can still return a hash that reflects the changes in any way (for example, if some file(s) have been removed).
Meanwhile, the list of files that caused errors will be available in the Walker
, but HashItem
will include error instead hash of file.
Ultimately, whether to ignore errors or not is up to the developer's choice.
fshasher
uses the log
crate, a lightweight logging facade for Rust. log
is used in conjunction with env_logger
. The following shell command will make some logs visible to you:
export RUST_LOG=debug
Contributions are welcome! Please read the short Contributing Guide.