Crates.io | bitbottle |
lib.rs | bitbottle |
version | 0.9.1 |
source | src |
created_at | 2021-10-26 05:47:21.727975 |
updated_at | 2022-01-10 01:19:01.89643 |
description | a modern archive file format |
homepage | https://code.lag.net/robey/bitbottle |
repository | https://code.lag.net/robey/bitbottle.git |
max_upload_size | |
id | 471617 |
size | 3,708,239 |
Bitbottle: a modern archive format.
Bitbottle is a data & file format for archiving collections of files & folders, like "tar", "zip", and "winrar". Its primary differentiating features are:
(*) I apologize for the ridiculous names. I did not name any of these algorithms.
cargo install bitbottle
After writing a few drafts in typescript going back to 2015, this is a rust version intended for a wider audience. As of Oct 2021, the basic tools work to build an archive and expand it. The file format is unlikely to change in a backward-incompatible way, though I reserve the right for emergencies until reaching 1.0.
The file format is documented in docs/format.md.
There are a couple of command-line tools for testing so far. All of them respond to --help
.
My intention is to make this project useful as a library, not just a set of CLI tools, but the current API is a bit awkward and needs some love before being frozen.
"bitbottle" creates archives from a list of files and folders. To encrypt an archive of the bitbottle source, using an SSH public (test) key, and "snappy" compression:
> ./target/release/bitbottle -v --snappy --pub ./tests/data/test-key.pub -o ./src-test.bb src
Encrypting for robey@togusa (34fd22aae3c59072fd6f48147309eb302ea30f6ae5fc6376f683df3e74485a7c)
drwxrwxr-x robey robey 2021-10-16 16:01:41 src/
-rw-rw-r-- robey robey 12.0K 2021-10-23 12:15:15 src/bottle.rs
-rw-rw-r-- robey robey 9.7K 2021-10-22 16:29:15 src/file_list.rs
[...]
Creating archive: 30 files, 225K bytes
Scanned unique blocks: 30 blocks, 225K bytes
Wrote 85.5K bytes.
"unbottle" can show the contents of an archive:
> ./target/release/unbottle -v --info ./src-test.bb
Bitbottle encrypted with XCHACHA20_POLY1305, 1 public key (ED25519_NACL_SEALED)
Block size: 1.00M
Encrypted for: robey@togusa (34fd22aae3c59072fd6f48147309eb302ea30f6ae5fc6376f683df3e74485a7c)
ERROR: No key or password provided for encrypted bottle
If the bottle is encrypted, you must use a secret key to decrypt it. For ED25519
, that means an SSH private key:
> ./target/release/unbottle -v --info --secret ./tests/data/test-key ./src-test.bb
Decrypting with key: robey@togusa
Bitbottle encrypted with XCHACHA20_POLY1305, 1 public key (ED25519_NACL_SEALED)
Block size: 1.00M
Encrypted for: robey@togusa (34fd22aae3c59072fd6f48147309eb302ea30f6ae5fc6376f683df3e74485a7c)
Bitbottle compressed with SNAPPY
drwxrwxr-x robey robey 2021-10-16 16:01:41 src/
-rw-rw-r-- robey robey 12.0K 2021-10-23 12:15:15 src/bottle.rs
-rw-rw-r-- robey robey 9.7K 2021-10-22 16:29:15 src/file_list.rs
[...]
Bitbottle: 30 files, 30 blocks, 225KB -> 85.5KB (BLAKE3 hash)
It will also expand an archive:
> ./target/release/unbottle -v --secret ./tests/data/test-key ./src-test.bb -d /tmp/src-test
Decrypting with key: robey@togusa
Bitbottle encrypted with XCHACHA20_POLY1305, 1 public key (ED25519_NACL_SEALED)
Block size: 1.00M
Encrypted for: robey@togusa (34fd22aae3c59072fd6f48147309eb302ea30f6ae5fc6376f683df3e74485a7c)
Bitbottle compressed with SNAPPY
drwxrwxr-x robey robey 2021-10-16 16:01:41 src/
-rw-rw-r-- robey robey 12.0K 2021-10-23 12:15:15 src/bottle.rs
-rw-rw-r-- robey robey 9.7K 2021-10-22 16:29:15 src/file_list.rs
[...]
Bitbottle: 30 files, 30 blocks, 225KB -> 85.5KB (BLAKE3 hash)
Extracted 30 file(s) (225K bytes) to /tmp/src-test
"buzscan" is a rust implementation of the buzhash chunking algorithm. It's mostly a demo and test tool for the algorithm used to build a bitbottle archive.
Buzhash is a type of rolling hash which computes a hash over a sliding window of data, rolling forward until it finds one with a specified number of trailing zeros.It breaks the file on these boundaries into roughly even-sized blocks, and emits each block's size and its hash (usually Blake3, but configurable). This can be used by an archiver to identify duplicate blocks. It's good at finding the same hash values inside large files, even after data is moved around.
Some implementations like borg (C source) use a random table or PRNG to map bytes. Buzscan uses a deterministic table built from recursive applications of CRC-32 that were selected to have a good bit distribution.
The "buzscan" CLI tool will traverse a list of files and folders (recursively) and build up a set of blocks, looking for duplicates, and report on the de-duplicated size of the data it found. It's very slow, because it's hashing everything it finds.
> ./target/release/buzscan .
[00:00:01] 935 files, 885 blocks, total disk space: 236M, 154M unique
Some of the modules are apparently not pure-Rust, including argonautica and rust-lzma. They require some local package installs:
(I wish there were native versions of these packages! Please help!)
cargo build --release
./target/release/bitbottle --help
To run the full test suite, which includes some integration tests written in python:
make test
A standard file archive consists of:
That is, the archive itself is a file list. The file list may be compressed, and the compressed data may also be encrypted. Encryption must be the outer-most layer if it is used. The file list is just a count of how many files and blocks are present, followed by a separate bottle for each file and each block.
To build an archive, write_archive
(in archive.rs) is given a list of starting paths. It scans each path recursively, building up a list of every file to include, then uses buzhash to break each file into blocks of roughly the same size (1MB by default). Each block is identified by its size and hash (Blake3 by default). If we see multiple blocks with the same size and hash, they're duplicates, and we only need to write each block once.
Once scanning is complete, we write the each file's metadata (its "atlas") as a separate bottle: The header contains its path, permissions, size, and the hash of its overall contents for extra validation. Folders and symlinks are written too, with a size of zero, no hash, and no blocks. For normal files, the bottle stream is a list of the hashes of the blocks that make up its content. (If the file has only one block, we skip this step, since the file's overall hash is also the hash of its only block.) Then we write a separate bottle for each scanned block.
To expand an archive, expand_archive
does the opposite: It reads the metadata for each file, and uses the list of block hashes to reassemble the file from each block.
The low-level format of the bitbottle file and the structure of a bottle is documented in docs/format.md.
For encryption with SSH keys, only Ed25519 keys are currently supported, and only in OpenSSH key files: technical description of OpenSSH key file format.
Apache 2.0 license, included in LICENSE.txt
.