> **Historical version.** This spec describes Bao 0.9.1. As of version 0.10, > Bao is based on the BLAKE3 standard, and much of this document no longer > applies. Here is the [current spec](spec.md). # The Bao Spec ## Contents * [Tree Structure](#tree-structure) * [Combined Encoding Format](#combined-encoding-format) * [Outboard Encoding Format](#outboard-encoding-format) * [Slice Format](#slice-format) * [Decoder](#decoder) * [Security](#security) * [Storage Requirements](#storage-requirements) * [Performance Notes](#performance-notes) * [Design Rationales and Open Questions](#design-rationales-and-open-questions) + [Why BLAKE2s instead of BLAKE2b?](#why-blake2s-instead-of-blake2b) + [Why a 4096-byte chunk size?](#why-a-4096-byte-chunk-size) + [Does Bao have a "high security" variant?](#does-bao-have-a-high-security-variant) + [Can we expose the BLAKE2 general parameters through the Bao API?](#can-we-expose-the-blake2-general-parameters-through-the-bao-api) + [Why does Bao set the node offset parameter for each chunk?](#why-does-bao-set-the-node-offset-parameter-for-each-chunk) + [Could we use a simpler tree mode, like KangarooTwelve does?](#could-we-use-a-simpler-tree-mode-like-kangarootwelve-does) + [Should we fall back to serial hashing for messages above some maximum size?](#should-we-fall-back-to-serial-hashing-for-messages-above-some-maximum-size) + [Is 64 bits large enough for the encoded length?](#is-64-bits-large-enough-for-the-encoded-length) + [Could a similar design be based on a different underlying hash function?](#could-a-similar-design-be-based-on-a-different-underlying-hash-function) + [Would hashing the length as associated data improve the security of the decoder?](#would-hashing-the-length-as-associated-data-improve-the-security-of-the-decoder) + [Would it be more efficient to use a fanout larger than 2?](#would-it-be-more-efficient-to-use-a-fanout-larger-than-2) + [Why is the encoding format malleable?](#why-is-the-encoding-format-malleable) * [Other Related Work](#other-related-work) ## Tree Structure Bao divides the input up into 4096-byte chunks. The final chunk may be shorter, but it's never empty unless the input itself is empty. The chunks are arranged as the leaves of a binary tree. All parent nodes have exactly two children, and the content of each parent node is the hash of its left child concatenated with the hash of its right child. When there's an odd number of nodes in a given level of the tree, the rightmost node is raised to the level above. Here's what the tree looks like as it grows from 1 to 4 chunks. ``` / \ / \ [CHUNK3] / \ / \ / \ / \ [CHUNK1] [CHUNK1] [CHUNK2] [CHUNK1] [CHUNK2] [CHUNK1] [CHUNK2] [CHUNK3] [CHUNK4] ``` We can also describe the tree recursively: - If a tree/subtree contains 4096 input bytes or less, the tree/subtree is just a chunk. - Otherwise, the tree/subtree is rooted at a parent node, and its input bytes are divided between its left and right child subtrees. The number of input bytes on the left is largest power of 2 times 4096 that's strictly less than the total. The remainder, always at least 1 byte, goes on the right. Hashing nodes is done with BLAKE2s, using the following parameters: - **Hash length** is 32. - **Fanout** is 2. - **Max depth** is 255. - **Max leaf length** is 4096. - **Inner hash length** is 32. - **Node offset** is incremented for each chunk, 0 for the first chunk, 1 for the second chunk, etc. This counter wraps to zero after reaching the maximum [BLAKE2X](https://blake2.net/blake2x.pdf) node offset, 232-1. It's 0 for all parent nodes. - **Node depth** is 0 for all chunks and 1 for all parent nodes. In addition, the root node -- whether it's a chunk or a parent -- has the **last node** finalization flag set to true. Note that BLAKE2 supports setting the last node flag at any time, so hashing the first chunk can begin without knowing whether it's the root. That root node hash is the output of Bao. Here's an example tree, with 8193 bytes of input that are all zero: ``` root parent hash=96e2ab... <7b34f3...57e13c...> / \ / \ parent hash=7b34f3... chunk hash=57e13c... <1f889c...48d13f...> [\x00] / \ / \ chunk hash: 1f889c... chunk hash: 48d13f... [\x00 * 4096] [\x00 * 4096] ``` We can verify those values on the command line using the `blake2` utility from [blake2_simd](https://github.com/oconnor663/blake2_simd): ```bash # Install the blake2 utility. Make sure your Cargo bin dir is in your $PATH. $ cargo install blake2_bin # Define some aliases for hashing nodes. $ alias hash_node="blake2 -s --fanout=2 --max-depth=255 --max-leaf-length=4096 --inner-hash-length=32" $ alias hash_chunk="hash_node --node-depth=0" $ alias hash_parent="hash_node --node-depth=1" # Hash the first chunk. $ first_chunk_hash=`head -c 4096 /dev/zero | hash_chunk --node-offset 0` $ echo $first_chunk_hash 1f889cb45b1901ce01bba35537ede436e5b84e00327eced603a46a9b2b029506 # Hash the second chunk. $ second_chunk_hash=`head -c 4096 /dev/zero | hash_chunk --node-offset 1` $ echo $second_chunk_hash 48d13f5d36b8c94c2d7ce8d59bf7053873f5f2cff8fbccd5c239f4fc752b2f88 # Hash the third chunk. $ third_chunk_hash=`head -c 1 /dev/zero | hash_chunk --node-offset 2` $ echo $third_chunk_hash 57e13cda44cdd714424d8ca9c1ae37c3c075ee5c872646eb40c5f58a4ee7cc87 # Define an alias for parsing hex. $ alias unhex='python3 -c "import sys, binascii; sys.stdout.buffer.write(binascii.unhexlify(sys.argv[1]))"' # Hash the first two chunks' parent node. $ left_parent_hash=`unhex $first_chunk_hash$second_chunk_hash | hash_parent` $ echo $left_parent_hash 7b34f3ebe21be2e02acf0da236f5fa5494653fbf465505e783f43b2dbb826885 # Hash the root node, with the last node flag. $ unhex $left_parent_hash$third_chunk_hash | hash_parent --last-node 96e2ab1a5486faeaecd306cd7fd7eed78bb48d33de4234b4dd019d481e790c4e # Verify that this matches the Bao hash of the same input. $ head -c 8193 /dev/zero | bao hash 96e2ab1a5486faeaecd306cd7fd7eed78bb48d33de4234b4dd019d481e790c4e ``` ## Combined Encoding Format The combined encoding file format is the contents of the chunks and parent nodes of the tree concatenated together in pre-order (that is a parent, followed by its left subtree, followed by its right subtree), with the 8-byte little-endian unsigned input length prepended to the very front. This makes the order of nodes on disk the same as the order in which a depth-first traversal would encounter them, so a decoder reading the tree from beginning to end doesn't need to do any seeking. Here's the same example tree above, formatted as an encoded file: ``` input length |root parent node |left parent node |first chunk|second chunk|last chunk 0120000000000000|7b34f3...57e13c...|1f889c...48d13f...|000000... |000000... |00 ``` ## Outboard Encoding Format The outboard encoding format is the same as the combined encoding format, except that all the chunks are omitted. In outboard mode, whenever the decoder would read a chunk, it instead reads it from the original input file. This makes the encoding much smaller than the input, with the downside that the decoder needs to read from two streams. ## Slice Format The slice format is very similar to the combined encoding format above. The difference is that the caller requests a specific start point and byte count, and chunks and parent nodes that wouldn't be encountered when seeking to that start point and reading that many bytes are omitted. For example, if we slice the tree above starting at input byte 4096 (the beginning of the second chunk), and request any count of bytes less than or equal to 4096 (up to the end of that chunk), the resulting slice will be this: ``` input length |root parent node |left parent node |second chunk 0120000000000000|7b34f3...57e13c...|1f889c...48d13f...|000000... ``` Although slices can be extracted from both combined and outboard encodings, there is no such thing as an "outboard slice". Slices always include chunks inline, as the combined encoding does. A slice that includes the entire input is exactly the same as the combined encoding of that input. Decoding a slice works just like decoding a combined encoding. The only difference is that in cases where the decoder would normally seek forward to skip over a subtree, the slice decoder keeps reading without a seek, since the subtree was already skipped by the slice extractor. A slice always includes at least one chunk, though in the empty encoding case that chunk is empty. If the requested byte count is zero, that's equivalent to a byte count of one, such that the chunk containing the start point is included in the slice. If the requested start point is at or past the end of the content, the final chunk is included. (See also the "final chunk requirement" below.) Apart from that, no parents or chunks after the end of the requested bytes are included. Either the slice extractor or the slice decoder may return an error if the requested bytes exceed the end of the content (strict bounds checking), or they may cap the requested bytes (permissive bounds checking). The reference implementation is permissive. ## Decoder After parsing the length from the first eight bytes of an encoding, the decoder traverses the tree by reading parent nodes and chunk nodes. The decoder verifies the hash of each node as it's read, and it may return the contents of each valid chunk to the caller immediately. The length itself is considered validated _when and only when_ the decoder validates the final chunk, either by validating the entire encoding or by seeking to the end and validating only the right edge of the tree. The decoder _must not_ expose the length to the caller in any way before the final chunk is validated. There are a number of ways the decoder might expose the length, some of which are less obvious than others: - An explicit `.length()` method. The reference implementation doesn't include one, because it would be required to seek internally. Callers who need the length in advance will usually do better to store it separately along with the hash. - Reading the empty encoding. Any read of the empty encoding reports EOF, thereby exposing the length (zero). The decoder must verify that the final chunk (that is, the empty chunk) matches the root hash. Most implementations will naturally satisfy this requirement for non-empty encodings as part of reading chunk bytes, but it's easy to forget it in the empty case by assuming that you're finished whenever the current position equals the content length. - Seeking past the end. If I seek to an offset greater than or equal to the content length, the next read will return EOF, exposing the length. That means either the seek itself or the following read must have validated the final chunk. - Seeking relative to the end. Most seek implementations support something akin to [`SeekFrom::End`](https://doc.rust-lang.org/std/io/enum.SeekFrom.html#variant.End). That exposes the length through the absolute offset returned by the seek, which means the seek itself must validate the final chunk. For decoders that don't expose a `.length()` method and don't support seeking, the security requirements can be simplified to "verify the hash of every node you read, and don't skip the empty chunk." But decoders that do support seeking need to consider the final chunk requirement very carefully. The decoder is expected to maintain these guarantees, even in the face of a man-in-the-middle attacker who modifies the encoded bytes: - Any output byte returned by the decoder matches the corresponding input byte from the original input. - If EOF is indicated to the caller in any way, either through a read or through a seek, it matches the length of the original input. - If the decoder reads a complete output to EOF, the decoding hash must be the Bao hash of that output. There are no "decoding collisions" where two different hashes decode the same output to EOF. (Though two different hashes may decode the same output, if at least one of the decodings encounters an error before EOF.) Note one non-guarantee in particular: The encoding itself may be malleable. Multiple "different" encodings may decode to the same input, under the same hash. See the design rationales for more on this. ## Security When designing a tree mode, there are pitfalls that can compromise the security of the underlying hash. For example, if one input produces a tree with bytes X at the root node, and we choose another input to be those same bytes X, do those two inputs result in the same root hash? If so, that's a hash collision, regardless of the security of the underlying hash function. Or if one input results in a root hash Y, could Y be incorporated as a subtree hash in another tree without knowing the input that produced it? If so, that's a length extension, again regardless of the properties of the underlying hash. There are many possible variants of these problems. [*Sufficient conditions for sound tree and sequential hashing modes*](https://eprint.iacr.org/2009/210.pdf) (2009), authored by the Keccak/SHA-3 team, lays out a minimal set of requirements for a tree mode, to prevent attacks like the above. This section describes how Bao satisfies those requirements. They are: 1. **Tree decodability.** The exact definition of this property is fairly technical, but the gist of it is that it needs to be impossible to take a valid tree, add more child nodes to it somewhere, and wind up with another valid tree. That is, it shouldn't be possible to interpret a leaf node as a parent node, or to add more children to an existing parent node. 2. **Message completeness.** It needs to be possible to reconstruct the original message unambiguously from the tree. 3. **Final-node separability.** Again the exact definition is fairly technical, but the gist is that it needs to be impossible for any root node and any non-root node to have the same hash. We ensure **tree decodability** by domain-separating parent nodes from leaf nodes (chunks) with the **node depth** parameter. BLAKE2's parameters are functionally similar to the frame bits used in the paper, in that two inputs with different parameters always produce a different hash, though the parameters are implemented as tweaks to the IV rather than by concatenating them with the input. Because chunks are domain-separated from parent nodes, adding children to a chunk is always invalid. That, coupled with the fact that parent nodes are always full and never have room for more children, means that adding nodes to a valid tree is always invalid. **Message completeness** is of course a basic design requirement of the encoding format. Concatenating the chunks in order gives the original message. We ensure **final-node separability** by domain-separating the root node from the rest of the tree with the **last node flag**. BLAKE2's last node flag is similar to its other parameters, except that it's an input to the last call to the compression function rather than a tweak to the IV. In practice, that allows an implementation to start hashing the first chunk immediately rather than buffering it, and to set the last node flag at the end if the first chunk turns out to be the only chunk and therefore the root. That framework concerns the security of the hash function. To reason about the security of the decoder, we can start by observing that decoding verifies the hash of almost every part of the encoded file. If the hash function is collision resistant, then these parts of the encoding can't be mutated without leading to a hash mismatch. The one part that isn't directly verified is the length header. The possibility that the length header might be mutated leads to the final chunk requirement above. If the length is increased, that's guaranteed to either lengthen the final chunk, or to replace the final chunk with another parent node. Likewise if the length is decreased, that's guaranteed to either shorten the final chunk, or to replace one of the parent nodes along the right edge of the tree with a chunk. All four of those scenarios are guaranteed to cause a hash mismatch, because of the collision resistance of BLAKE2s and because of the domain separation between parents and chunks, as long as the decoder validates the final chunk. The design rationales discuss further why we prefer to rely on this final chunk requirement rather than hashing the length as associated data. ## Storage Requirements A Bao implementation needs to store one hash (32 bytes) for every level of the tree. The largest supported input is 264 - 1 bytes. Given the 4096-byte chunk size (212), that's 252 leaf nodes, or a maximum tree height of 52. Storing 52 hashes, 32 bytes each, requires 1664 bytes, in addition to the [168 bytes](https://blake2.net/blake2.pdf) required by BLAKE2s. For comparison, the TLS record buffer is 16384 bytes. Extremely space-constrained implementations that want to use Bao have to define a more aggressive limit for their maximum input size. In some cases, such a limit is already provided by the protocol they're implementing. For example, the largest possible IPv6 "jumbogram" is 4 GiB. Limited to that maximum input size, Bao's storage overhead would be 20 hashes or 640 bytes. ## Performance Notes To get the highest possible throughput, the reference implementation uses both multithreading and SIMD instructions. Threading exploits the multiple cores available on modern CPUs, and SIMD allows each thread to hash multiple chunks in parallel for much higher throughput per thread. Multithreading in the current implementation is done with the [`join`](https://docs.rs/rayon/latest/rayon/fn.join.html) function from the [Rayon](https://github.com/rayon-rs/rayon) library. It splits up its input recursively -- an excellent fit for traversing a tree -- and allows worker threads to "steal" the right half of the split, if they happen to be free. Once the global thread pool is initialized, this approach doesn't require any heap allocations. There are two different approaches to using SIMD to speed up BLAKE2. The more common way is to optimize a single instance, and that's the approach that just beats SHA-1 in the [BLAKE2b benchmarks](https://blake2.net/). But the more efficient way, when you have multiple inputs, is to run multiple instances in parallel on a single thread. Samuel Neves discussed the second approach in [a 2012 paper](https://eprint.iacr.org/2012/275.pdf) and implemented it in the [reference AVX2 implementation of BLAKE2sp](https://github.com/sneves/blake2-avx2/blob/master/blake2sp.c). The overall throughput is more than triple that of a single optimized BLAKE2s instance. The [`blake2s_simd`](https://github.com/oconnor663/blake2_simd) implementation includes a [`hash_many`](https://docs.rs/blake2s_simd/0.5.1/blake2s_simd/many/fn.hash_many.html) interface, which provides the same speedup for multiple instances of BLAKE2s, and Bao uses that interface to make each worker thread hash multiple chunks in parallel. Note that the main downside of BLAKE2sp is that it hardcodes 8-way parallelism, such that moving to a higher degree of parallelism would change the output. But `hash_many` doesn't have that limitation, and when AVX-512 becomes more widespread, it will execute 16-way parallelism without changing the output or the API. ## Design Rationales and Open Questions ### Why BLAKE2s instead of BLAKE2b? **It's faster, both on small 32-bit embedded systems and on modern 64-bit systems with SIMD.** There are two important differences between BLAKE2s and BLAKE2b: - BLAKE2s operates on 32-bit words, while BLAKE2b operates on 64-bit words. - BLAKE2s does 10 rounds of compression, while BLAKE2b does 12. This is related to their 128-bit vs 256-bit security levels and the larger state size of BLAKE2b, which needs more rounds to get good diffusion. This is similar to the difference between SHA-256 and SHA-512. With both BLAKE2 and SHA-2, many sources note that the 64-bit variants have better performance on 64-bit systems. This is true for hashing a single input, because 64-bit instructions handle twice as much input without being twice as slow. On modern SIMD hardware, a single instance of BLAKE2b can also take advantage of 256-bit vector arithmetic, while BLAKE2s can only make use of 128-bit vectors. However, when the implementation is designed to hash multiple inputs in parallel, the effect of SIMD is different. In the parallel mode, both BLAKE2b and BLAKE2s can use vectors of any size, by accumulating words from a corresponding number of different inputs. Besides being more flexible, this approach is also substantially more efficient, because the diagonalization step in the compression function disappears. With 32-bit words and 64-bit words on a level playing field in terms of SIMD throughput, the remaining performance difference between the two functions is that BLAKE2s does fewer rounds, which makes it faster. That's the story for long inputs, but for short inputs there are more factors to consider. With just a few bytes of input, both BLAKE2s and BLAKE2b will compress a single block, and most of that block will be padded with zeros. In that case it's not average throughput that matters, but the time it takes to hash a single block. Here BLAKE2s wins again, both because of its smaller round count and because of its smaller block size. On the other hand, there's a regime in the middle where BLAKE2b scores some points. For inputs that are one chunk long, there's nothing to parallelize, and the higher single-instance throughput of BLAKE2b wins on 64-bit machines. Also for inputs that are a few chunks long, the lower parallelism degree of BLAKE2b helps make earlier use of SIMD. For example, a parallel implementation of BLAKE2b using 256-bit vectors executes 4 hashes at once, so an 8-chunk input can be efficiently split between 2 threads. But a parallel implementation of BLAKE2s using 256-bit vectors executes 8 hashes at once, so the 8-chunk input has no room for a second thread, and a 4-chunk input would be stuck using 128-bit vectors. In general it takes twice as many chunks for BLAKE2s to make use of any given SIMD vector width. However, this advantage for BLAKE2b comes with several caveats: - It assumes a constant chunk size, but the chunk size is a free parameter in the Bao design. If medium-length input parallelism was our biggest concern, we could make up the difference by halving the chunk size, probably with only a few percentage points of long-length throughput lost. - Performance differences matter more at the extremes. For extremely long inputs, BLAKE2s wins because its lower round count leads to higher overall throughput. For extremely limited hardware without 64-bit arithmetic or SIMD, BLAKE2s wins because of its 32-bit words. - It's possible to parallelize Bao across multiple inputs just like we parallelize BLAKE2s across multiple inputs. In fact, inputs up to one chunk long are just a single BLAKE2s hash, and the parallel BLAKE2s implementation could be reused as-is to parallelize the Bao hashes of those short inputs. It's unlikely that anyone will go through the trouble of implementing this in practice, but an application with a critical bottleneck hashing medium-length inputs has this option. ### Why a 4096-byte chunk size? **Somewhat arbitrary.** Power-of-two chunk sizes lead to simpler arithmetic and more efficient IO. But as for which power of two to choose, there are different tradeoffs, and a chunk size of 2048 or 8192 could also have been reasonable. As noted above, the main advantage of a small chunk size is that it allows the implementation to parallelize more work for inputs that are only a few chunks long. On the other hand, the advantage of a large chunk size is that it reduces the number of parent nodes in the tree and the overhead of hashing them, which leads to better throughput for large inputs. Here are [throughput measurements](https://github.com/oconnor663/bao_experiments/blob/72a13726a1dfb5ed67d41fdea1932a8b0b0dfc0b/benches/libtest.rs#L18-L23) for hashing a 2^24 byte (16.8 MB) input at different chunk sizes on my laptop (Intel Core i5-8250U, 4 physical / 8 logical cores, AVX2, with TurboBoost disabled): |chunk size (bytes)|throughput (MB/s)|% of max| |------------------|-----------------|--------| |65536 |4584 |100 | |32768 |4571 |99.7 | |16384 |4563 |99.5 | |8192 |4513 |98.5 | |4096 |4439 |96.8 | |2048 |4122 |89.9 | |1024 |3782 |82.5 | |512 |3272 |71.4 | |256 |2596 |56.6 | In these measurements, increasing the chunk size has a big impact on throughput up to 4096 bytes. Going from 2048 bytes to 4096 bytes raises throughput by about 7%. But further increases in the chunk size are less impactful, below 2%. Another consideration might be how much buffer space a streaming implementation needs to allocate to take full advantage of SIMD. The widest SIMD instruction set available on x86 today is AVX-512, which can run 16 BLAKE2s hashes in parallel. With a chunk size of 4096 bytes, a 16-chunk buffer is 64 KiB, which is already e.g. the [default maximum stack buffer size under musl libc](https://wiki.musl-libc.org/functional-differences-from-glibc.html#Thread_stack_size). That's another small motivation not to use chunks larger than 4096 bytes. ### Does Bao have a "high security" variant? **No.** A 256-bit digest with its 128-bit security level is enough for practically any cryptographic application, which is why everyone uses SHA-256 for TLS certificates and why the Intel SHA Extensions don't include SHA-512. Higher security levels waste cycles, and longer digests waste bandwidth. Also having multiple variants of the same algorithm complicates implementations and confuses people. ### Can we expose the BLAKE2 general parameters through the Bao API? **Yes, though there are some design choices we need to make.** The general parameters are the variable output length, secret key, salt, and personalization string. A future version of this spec will almost certainly settle on a way to expose them. The salt and personalization will probably be simple; just apply them to all nodes in the tree. The right approach for the secret key is less clear. The BLAKE2 spec says: "Note that tree hashing may be keyed, in which case leaf instances hash the key followed by a number of bytes equal to (at most) the maximal leaf length." That remark actually leaves the trickiest detail unsaid, which is that while only the leaf nodes hash the key bytes, _all_ nodes include the key length as associated data. This is behavior is visible in the BLAKE2sp [reference implementation](https://github.com/BLAKE2/BLAKE2/blob/a90684ab3fe788b2ca45076cf9b38335de289f58/ref/blake2sp-ref.c#L64) and confirmed by its test vectors. Unfortunately, this behavior is actually impossible to implement with Python's `hashlib.blake2s` API, which ties the key length and key bytes together. An approach that applied the key bytes to every node would fit into Python's API, but it would both depart from the spec conventions and add extra overhead. Implementing Bao in pure Python isn't necessarily a requirement, but the majority of BLAKE2 implementations in the wild have similar limitations. The variable output length has a similar issue. In BLAKE2sp, the root node's output length is the hash length parameter, and the leaf nodes' output length is the inner hash length parameter, with those two parameters set the same way for all nodes. That's again impossible in the Python API, where the output length and the hash length parameter are always set together. Bao has the same problem, because the interior subtree hashes are always 32 bytes. ### Why does Bao set the node offset parameter for each chunk? **To discourage implementers from making non-constant-time optimizations.** One of the important security properties of a cryptographic hash is that it runs in the same amount of time for any two inputs of the same size. Otherwise it would leak information about its input through the timing side-channel. BLAKE2s was designed to be constant time, so a straightforward implementation of Bao is also constant time by default. However, consider the following dangerous optimization: > For each chunk, before hashing it, check to see whether it's all zero. If so, > skip the work of hashing it, and use a hardcoded hash of the zero chunk. For inputs that are known to contain mostly zeros, this optimization could lead to a huge speedup, by letting Bao skip almost all of its work. But it would violate the constant time property and leak lots of information about the structure of its input. This sort of mistake is what the node offset parameter is preventing. Because each chunk uses a different node offset, optimizations like this one don't work, and so implementations will tend to preserve constant time execution (a ["pit of success"](https://blog.codinghorror.com/falling-into-the-pit-of-success/)). The maximum node offset in [BLAKE2X](https://blake2.net/blake2x.pdf) (reduced from the original BLAKE2 spec) is 232-1, which is smaller than Bao's maximum input length of 264-1 bytes or 252 chunks. That means that node offsets will repeat given a very large input (16 TiB). However, note that the Security section above doesn't rely on the node offset parameter. The implementation details that prevent collisions and length extensions under the [*Sufficient conditions*](https://eprint.iacr.org/2009/210.pdf) framework are the node depth parameter and the last node flag, which domain separate chunk/parent nodes and root/non-root nodes respectively. Since we don't rely on the node offset for security in general, its range only needs to be large enough to discourage the optimizations we want to discourage. And 16 TiB is large enough. By not setting the node offset for parent nodes, and by using only 0 and 1 for the node depth parameter, Bao breaks the conventions of the [BLAKE2 spec](https://blake2.net/blake2.pdf). Some breakage was necessary, though, because we use the last node flag as a root distinguisher, while the spec uses it on every level of the tree. I believe this is a mistake in the original spec, because distinguishing the root is necessary to prevent generalized length extensions. The spec's root distinguisher is the combination of the last node flag and the zero value of the node offset. However, as noted above, the node offset might wrap for very large inputs. If you have a tree hash based on BLAKE2 according to the conventions of the spec, and the node offset is defined to wrap, you can "extend" that hash by prepending 16 TiB of arbitrary data (or less if the chunk size is smaller). That's not a very practical attack, and the spec could mitigate it by requiring that the node offset saturate instead of wrapping, but ultimately it's better to prevent it by using the last node flag exclusively for the root. Apart from that, setting the node offset and node depth for parent nodes as in the spec would require some extra math to tell which subtrees are being merged in the subtree stack, which we don't currently need to know. Since there's not much value in sticking closer-but-still-not-fully to the original conventions, the Bao design prioritizes simplifying the implementation. ### Could we use a simpler tree mode, like KangarooTwelve does? **No, the encoding format requires a full tree.** [KangarooTwelve](https://keccak.team/kangarootwelve.html) is a modern hash function based on Keccak/SHA3, and it includes a ["leaves stapled to a pole"](https://www.cryptologie.net/article/393/kangarootwelve) tree internally to allow for parallelism in the implementation. This is much simpler to implement than a full binary tree, and it adds less storage overhead. However, a shallow tree would limit the usefulness of Bao's encoding and slicing features. The root node becomes linear in the size of the input. Encoding a gigabyte file, for example, would produce a root node that's several megabytes. The recipient would need to fetch and buffer the entire root before verifying any content bytes, and decoding would require heap allocation. The usefulness of the encoding format would be limited to the space of files big enough that streaming is valuable, but small enough that the root node is manageable, and it would preclude most embedded applications. Incremental update schemes would suffer too, because every update would need to rehash the large root node. A two-level tree would also limit parallelism. As noted in the [KangarooTwelve paper](https://eprint.iacr.org/2016/770.pdf), given enough worker threads hashing input chunks and adding their hashes to the root, the thread responsible for hashing the root eventually reaches its own throughput limit. If the root hash has the same throughput as the leaves, this happens at a parallelism degree equal to the ratio of the chunk size and the hash length, 256 in the case of KangarooTwelve. In practice, devoting an entire core to a single instance roughly doubles that one instance's throughput, bringing the max degree to 512. That sounds like an impossibly high number, but consider that one of Bao's benchmarks runs on a 48-physical-core AWS "m5.24xlarge" machine, and that the AVX-512 version of BLAKE2s ([preliminary benchmarks](https://github.com/oconnor663/blake2s_simd/issues/1#issuecomment-484572123)) will hash 16 chunks in parallel per thread. That machine supports parallelism degree 768 today. A two-level tree also creates a specific problem for multithreading. Multithreaded parallelism is "coarse-grained", in that it works best when each thread can execute a long-running task without needing to communicate, because communication between threads is relatively expensive. For example, with two threads working on a binary tree, we can split the output approximately in half, with each thread computing a single subtree hash over its half without communicating at all with the other. This doesn't work as well with a two-level tree, though, because each thread (or at least each thread besides the first) would need to return a long list of leaf hashes, which would mean allocating memory. To avoid storing leaf hashes, the threads could instead work on alternating even and odd leaves, both starting at the beginning of the input. That would allow each leaf hash to be incorporated into the root immediately, at the cost of communicating between threads after each leaf, and possibly some smaller unavoidable heap allocations. That cost can be amortized over a large leaf size, or over a larger group of leaves, down to a level that's acceptable on a single machine. But when the cost of communication is even higher, like in a MapReduce job or some other distributed setting, it becomes a problem again. ### Should we fall back to serial hashing for messages above some maximum size? **No.** Many tree modes, including some described in the [BLAKE2 spec](https://blake2.net/blake2.pdf), fall back to a serial mode after some threshold. That allows them to specify a small maximum tree height for reduced memory requirements. However, one of Bao's main benefits is parallelism over huge files, and falling back to serial hashing would conflict with that. It would also complicate encoding and decoding. ### Is 64 bits large enough for the encoded length? **Yes.** Every filesystem in use today has a maximum file size of 264 bytes or less. Bao's decoding features are designed to work with the IO interfaces of mainstream programming languages, particularly around streaming and seeking. These interfaces are [usually restricted](https://doc.rust-lang.org/std/io/enum.SeekFrom.html) to 64-bit sizes and offsets. If Bao supported longer streams in theory, implementations would need to handle more unrepresentable edge cases. (Though even with a 64-bit counter, the maximum _encoded_ file size can exceed 64 bits, and a perfect decoder implementation would need to seek twice to reach bytes near the end of max-size encodings. In practice the reference implementation returns an overflow error in that unlikely case.) Implementations also need to decide how much storage overhead is reasonable. If the counter were larger, it would still make almost no sense to allocate space for a larger tree. The recommendation would probably be to assume a maximum stack depth of 52 as we do now, but it would put the burden of choice on each implementation. ### Could a similar design be based on a different underlying hash function? **Yes, as long as the underlying hash prevents length extension.** SHA-256 and SHA-512 aren't suitable, but SHA-512/256 and SHA-3 could be. Domain separation between the root and non-root nodes, and between chunks and parent nodes, is a security requirement. For hash functions without associated data parameters, you can achieve domain separation with a small amount of overhead by appending some bits to every node. See for example the [Sakura coding](https://keccak.team/files/Sakura.pdf), also designed by the Keccak/SHA-3 team. Note that the chunk/parent distinguisher may be an initialization parameter (as `node_depth` is), but the root/non-root distinguisher needs to be a finalization parameter (as `last_node` is) or an input suffix. Making the root/non-root distinguisher part of initialization would force the implementation to either buffer the first chunk or to hash it both ways. As noted above, there's no "high security" variant of Bao. However, in some future world with large quantum computers, it could theoretically make sense to define a new hash function targeting a 256-bit security level. We could achieve that by replacing BLAKE2s with BLAKE2b with very few other changes. ### Would hashing the length as associated data improve the security of the decoder? **No, not for a correct implementation.** An attacker modifying the length bytes can't produce any observable result, other than the errors that are also possible by modifying or truncating the rest of the encoding. The story is more complicated if we consider "sloppy" implementations that accept some invalid encodings, in which case hashing the length could mitigate some attacks but would also create some new ones. An earlier version of the Bao design did hash the length bytes, but the current design doesn't. Let's start by considering a correct decoder. What happens if a man-in-the-middle attacker tweaks the length header in between encoding and decoding? Small tweaks change the expected length of the final chunk. For example, if you subtract one from the length, the final chunk might go from 10 bytes to 9 bytes. Assuming the collision resistance of BLAKE2, the 9-byte chunk will necessarily have a different hash, and validating it will lead to a hash mismatch error. Adding one to the length would be similar. Either no additional bytes are available at the end to supply the read, leading to an IO error, or an extra byte is available somehow, leading to a hash mismatch. Larger tweaks have a bigger effect on the expected structure of the tree. Growing the tree leads to chunk hashes being reinterpreted as parent hashes, and shrinking the tree leads to parent hashes being reinterpreted as chunk hashes. Since chunks and parents are domain separated from each other, this also leads to hash mismatch errors in the tree, in particular always including some node along the right edge. Those observations are the reason behind the "final chunk requirement" for decoders. That is, a decoder must always validate the final chunk before exposing the length to the caller in any way. Because an attacker tweaking the length will always invalidate the final chunk, this guarantees that the modified length value will never be observed outside of the decoder. Length tweaks might or might not invalidate earlier chunks before the final one, and decoding some chunks might succeed before the caller eventually hits an error, but that's indistinguishable from simple corruption or truncation at the same point in the tree. So, what happens if the decoder gets sloppy? Of course the implementation could neglect to check any hashes at all, providing no security. Assuming the implementation does check hashes, there are couple other subtle mistakes that can still come up in practice (insofar as I made them myself in early versions of the reference implementation). The first one, we just mentioned: failure to uphold the "final chunk requirement". As a special case, recall that the empty tree consists of a single empty chunk; if the decoder neglects to validate that empty chunk and skips right to its EOF state, then the empty encoding wouldn't actually validate anything at all, making it appear valid under _any_ root hash. More generally, if the decoder seeks past EOF or relative to EOF without validating the final chunk first, an attacker could effectively truncate encodings without detection by shortening the length, or change the target offset of EOF-relative seeks. The other likely mistake is "short reads". The IO interfaces in most languages are based on a `read` function which _usually_ returns as many bytes as you ask it to but which _may_ return fewer for any reason. Implementations need to either call such functions in a loop until they get the bytes they need, or use some higher level wrapper that does the same. Implementations that neglect to call `read` in a loop will often appear to work in tests, but will be prone to spurious failures in less common IO settings like reading from a pipe or a socket. This mistake also opens up a class of attacks. An attacker might modify the length header of an encoding, for example creating an encoding with 9 content bytes but a length header of 10. In this case, a correct decoder would fail with an unexpected-end-of-file error, but an incorrect decoder might "short read" just the 9 bytes without noticing the discrepancy and then successfully validate them. That exposes the caller to inconsistencies that the attacker can control: The length of all the bytes read (9) doesn't match the offset returned by seeking to EOF (10), and like the "final chunk requirement" issue above, an attacker can again change the target offset of EOF-relative seeks. With those two classes of attacks in mind, we can come back to the original question: Would hashing the length as associated data (probably as a suffix to the root node) mitigate any of the attacks above for sloppy implementations? The answer is some yes and some no. The most egregious "final chunk requirement" vulnerability above -- validating nothing at all in the case of an empty encoding -- remains a pitfall regardless of associated data. Seek-past-EOF also remains a pitfall but in a slightly modified form: the implementation might detect the modified length if it validates the root node before seeking, but forgetting to validate the root node would be the same class of mistake as forgetting to validate the final chunk. However, the decoder would do better in any scenario where you actually tried to read chunk data; getting to a chunk means validating the root node on your way down the tree, and bad associated data would fail validation at that point. The "short reads" vulnerabilities above would also be partially mitigated. In a scenario where the attacker corrupts a "legitimate" encoding by changing its length header after the fact, hashing the length as associated data would detect the corruption and prevent the attack. But in a scenario where the attacker constructs both the encoding and the hash used to decode it, the attacker may craft an "illegitimate" root hash that _expects_ an inconsistent length header. (A typical encoder doesn't expose any way for the caller to put an arbitrary value in the length header, but there's nothing stopping an attacker from doing it.) In this case the inconsistent length pitfall would remain: the root node would validate with the bad length as associated data, the final chunk would validate with a short read, and once again the length of all the bytes read wouldn't match the offset returned by seeking to EOF. If that were the whole story -- that hashing the length as associated data mitigated some attacks on sloppy implementations -- that would be some motivation to do it. However, that last scenario above actually leads to a new class of attacks, by violating Bao's "no decoding collisions" guarantee. That is, no input should ever decode (successfully, to completion) under more than one hash. (And naturally the one hash an input does decode under must be the hash of itself.) The illegitimate encoding above and its exotic root hash constitute a "decoding collision" with the legitimate encoding they're derived from. To put yourself in the shoes of a caller who might care about this property, imagine you have a dictionary containing Bao encodings indexed by the hashes that decode them. If you find that a given string's hash isn't present as a key in your dictionary, is it safe for you to assume that no encoding in your dictionary will decode to that string? Bao says yes, you may assume that. And maybe equally importantly, callers in that scenario _will_ assume that without bothering to read the spec. In this sense, including the length as associated data would actually make sloppy implementations _less_ secure, by giving attackers a way to create decoding collisions. Earlier versions of Bao did append the length to the root node, instead of using a chunk/parent distinguisher. A proper distinguisher (the `node_depth` initialization parameter) was added later, both to better fit the [*Sufficient conditions*](https://eprint.iacr.org/2009/210.pdf) framework and to avoid issues around integer overflow. At that point the length suffix was redundant, and it also incurred some performance overhead in the short message case, where a one-block message would require two blocks of compression. It was dropped mainly for that performance reason, since the sloppy implementation concerns above aren't decisive either way. ### Would it be more efficient to use a fanout larger than 2? **Only a little, not enough to justify the added complexity.** Parent nodes in the binary tree exactly fit a BLAKE2s input block, so there's no free throughput to be had by making the parents larger. Rather, a larger fanout could help by reducing the number of levels in the tree. At most (with unlimited fanout, like KangarooTwelve) the total number of parent bytes hashed can be reduced by 50%. However, the parent bytes are already a small fraction of the input bytes by design, so the potential gain here is small. In my [throughput measurements](https://github.com/oconnor663/bao_experiments/blob/72a13726a1dfb5ed67d41fdea1932a8b0b0dfc0b/benches/libtest.rs#L32-L37), the largest efficiency gain seems to come at fanout 8, where the throughput for 4096-byte blocks is about the same as it is with fanout 2 and 8192-byte blocks. As noted above, this is a small difference, less than 2%. At the same time, using a fanout greater than 2 brings in a lot of new complexity: - Parent nodes along the right edge of the tree may have fewer children, meaning that parent nodes are of variable size. - The implementation needs to do more bookkeeping to maintain enough separate inputs for good SIMD performance. Consider a 4-ary tree using a SIMD degree of 8 and 9 chunks of input. The chunks need to be grouped 4-4-1 to fit the 4-ary tree structure, but SIMD performance is better if 8 chunks get hashed together. - There's more ambiguity in the tree layout. For example with 6 chunks and fanout 4, you could plausibly lay out the tree in the following two ways. ``` root root / \ \ / \ parent * * parent parent / / \ \ / / \ \ / \ * * * * * * * * * * ``` These sorts of issues -- considered across hashing, encoding, and decoding -- would be an implementation and testing burden that's not worth the small performance gain. On the other hand, there's also the question of whether a larger fanout could reduce the space requirement, which might matter for small embedded implementations. At first glance, it's not clear that a larger fanout helps here. For example, at fanout 4 the maximum depth of the subtree stack is halved, but the implementation needs to store up to 3 subtree hashes at each level instead of 1, so the overall storage requirement is actually 50% larger. However, the implementation might instead store the incremental state of a parent hash at each level rather than a complete list of subtree hashes. The words in an incremental state are 32 bytes, and the implementation would probably need to buffer a full 64 byte block along with the words to account for finalization. That's no space benefit for a fanout of 4 (the 3 subtree hashes per level described above were also 96 bytes), but it's independent of the fanout, so larger fanouts could benefit from reduced stack depth without increasing the storage per level any further. Conceptually at unlimited fanout (again like KangarooTwelve), there would be no stack at all, just the single incremental state of the root node. Ultimately this space optimization is too niche to justify shaping the entire Bao design around it. Without a specific platform in mind, it's hard to say how much space savings is needed. And applications with extremely tight storage requirements already have the option of constraining the maximum input size, as noted above. ### Why is the encoding format malleable? **Because the decoder doesn't read EOF from the encoding.** For example, if the decoder reads the 8-byte length header and parses a length of 10 bytes, its next read will be exactly 10 bytes. Since that's the final and only chunk, decoding is finished, and the decoder won't do any more reads. Typically the encoded file would have EOF after 18 bytes, so that another read would have returned zero bytes anyway. However, the encoded file may also have "trailing garbage" after byte 18. Since the decoder never looks at those bytes, they have no effect on decoding. That means that two "different looking" encoded files, which differ in their trailing garbage, can successfully decode to the same output. Note that this only concerns decoding, not hashing. There's no such thing as trailing garbage in the context of hashing, because hashing operates on raw input bytes rather than on the encoding format, and because hashing necessarily reads its input to EOF. Rather, this concerns applications that might want to compare two encoded files byte-for-byte, maybe as a shortcut to predict whether decoding them would give the same result. That logic would be broken in the presence of trailing garbage added by a bug or by an attacker. The only valid way to learn anything about the contents of an encoded file is to decode it. Another scenario that might lead to malleability is that a decoder might not verify all parent nodes. For example, if the decoder sees that an encoding is 8 chunks long, and it has buffer space for all 8 chunks, it might skip over the encoded parent nodes and just reconstruct the whole tree from its chunks. The reference decoder doesn't do this, in part because it could hide encoder bugs that lead to incorrect parent nodes. But if an implementation is careful not to emit any chunk bytes to the caller until all of them have been verified against the root hash, it can skip reading parent nodes like this without violating the security requirements. In this case, those parent nodes would be malleable. As discussed above, a "sloppy" decoder might also ignore some mutations in the length header, without failing decoding. That's strictly incorrect, and it violates security requirements related to the original input length, but the possibility that a buggy implementation might do that is yet another reason to assume that encoded bytes are malleable. To be clear though, none of the scenarios discussed in this section violate the guarantee that decoded bytes match the original input. ## Other Related Work - The [Tree Hash Exchange](https://adc.sourceforge.io/draft-jchapweske-thex-02.html) format (2003). THEX and Bao have similar tree structures, and both specify a binary format for encoding the tree to enable incremental decoding. THEX uses a breadth-first rather than depth-first encoding layout, however, which makes the decoder's storage requirements much larger. Also, as noted by the Keccak/SHA-3 team in [*Sufficient conditions*](https://eprint.iacr.org/2009/210.pdf), THEX doesn't domain-separate its root node, so it's vulnerable to length extension regardless of the security of the underlying hash function. - [Tree Hashing](https://www.cryptolux.org/mediawiki-esc2013/images/c/ca/SL_tree_hashing_esc.pdf) (2013), a presentation by Stefan Lucks, discussing the requirements for standardizing a tree hashing mode. - [RFC 6962](https://tools.ietf.org/html/rfc6962) uses a similar tree layout and growth strategy.