Crates.io | fsmap |
lib.rs | fsmap |
version | 0.3.0 |
source | src |
created_at | 2024-04-26 20:33:47.775125 |
updated_at | 2024-05-01 10:05:52.460962 |
description | File system mapper and search tool |
homepage | |
repository | https://github.com/berke/fsmap |
max_upload_size | |
id | 1221786 |
size | 72,511 |
fsmap is a small tool that scans a Unix filesystem and creates a serialized listing of the filesystem contents (the "map").
The content includes the directory structure, file sizes, one timestamp (the newest of the creation, access and modification times with a resolution of one minute) and symbolic links.
Serialization is done using serde in RMP format.
The listing can then be loaded into memory and examined or dumped.
File content digests are not included.
Indexing works well, search functionality is OK, memory consumption and load times are a bit high.
Suppose you never got around to setting up that fancy RAID NAS with ZFS and indexing and snapshots. Instead, your drawers contain half a dozen hard disk drives with mostly static contents.
Where did you put that particular file from five years ago?
Yes you did run find /mnt/my-twelfth-drive >index12
but that gives you no date information, no size information,
and grepping that isn't very convenient.
This tool provides a solution.
You create a map for each drive and keep them on your computer, so that you can search their contents without having to plug each drive in.
Searches go also much faster than using e.g. find on a live filesystem, and you can search by date, file size, and use boolean operators.
To create a map:
fsmap collect --out filesystem.mpk /path/to/my/filesystem
To restrict the map to the first filesystem encountered, add --one-device
To dump the map (with an optional filtering expression EXPR
):
fsmap dump [--expr EXPR] filesystem.mpk
To interactively examine the map, type:
fsmap examine filesystem1.mpk filesystem2.mpk ...
Use ls EXPR
to list entries matching EXPR
. Type help
to get a list of other commands.
Regular expressions are processed using the excellent regex crate.
Examples:
ls mkv$
- Simple regexes do not need to be quoted
ls '\.(mkv|mp4)$'
- Single quotes are needed if certain characters are presentothers
ls usr/share
- Regular expression is matched against full path...
ls %name share
- ...unless %name
is used
ls mkv$ & usr/share
- Expressions can be combined using &
(and)...
ls '\.qcow2$' | %larger 1G
- ...using |
(or)
ls lapack \ %name ^lib
- ...using \
(difference)
ls %after 2014-03-06 & %before 2014-03-09 & reg.*mp4
- Date operators
ls '(?i:\.jpeg$)'
- Case insensitive
quit
- exitYou can ^C in the middle of a listing to get back to the prompt.
Command line history is saved into ~/.fsmap-hist
unless the option
--no-history
is provided.
The map files have no index of any kind (except for the per-device inode maps); fsmap will just gobble up everything and hold it in memory. This can amount to many gigabytes. Search performance is sufficient for my present needs.
The indices are quite large, but can be significantly compressed in my
tests down to 1/6th the original size using xz
, while other tools provide
about 2/3rds reduction.
Built-in compression and decompression filters would be helpful.
Scanning performance can be improved significantly by serializing while scanning, and possibly by ditching Serde.
A mmappable index would be great, as load times are significant.
In hindsight, I probably should have stored timestamps for directories.
Also, taking the newest of the three Unix timestamps wasn't that good
of an idea, as filesystems are usually mounted with atime
.
Feel free to post suggestions on Github.
Paged output.
Adding variables to the command language is tempting, but it's also hard to avoid producing yet another crappy programming language.
This tool uses the following important crates:
regex
for regular expressionsserde
and rmp_serde
for MPK serialization and deserializationrustyline
for command-line parsingTo detect ^C I ended up writing the 67 line sigint_detector.rs
module but only after spending a good afternoon trying read through
signal_hook
to understand why ^C wasn't working. It wasn't
signal_hook
's fault but I realized in horror that it totaled more
than 3000 lines, just to catch a signal... come on now, I'm not
saying there can't be a use case for gold-plated, cross-platform,
thread and data race safe signal handling but I just want to catch
a ctrl-C and be able to figure out what's happening so that I can
debug it without getting lost in a maze of two crates and thousands
of lines. This is getting ridiculous.
I don't miss the old C days of dealing with malloc()
and strlen()
and consorts, but I do miss the nice Unix module of OCaml (which
probably even works under Windows for most things.)
Whatever. I've thrown in Apache and MIT.
Berke DURAK bd@exhrd.fr