# BLAKE3-AEAD
*an **experimental** authenticated cipher built entirely on top of the BLAKE3
hash function*
## Primary goals
- Drop-in replacement for AES-GCM in TLS-like use cases
- High performance for short messages (~1 KiB)
- Defined in terms of the standard BLAKE3 API
## Nice-to-have features
- Nonces can be up to 64 bytes.
- Relatively large maximum message and AAD lengths, 262 and
262-1 bytes respectively
- The message and associated data can be processed in in parallel, in one pass
each, without knowing their lengths in advance.
- Truncating the auth tag to N bits retains the expected O(2N) bits
of security. (Correct?)
- A compact implementation can work directly with the BLAKE3 compression
function and omit the tree hashing parts.
## Sharp edges
- Unique nonces are the caller's responsibility.
- Nonce reuse is catastrophic for security.
- Decryption has to either buffer the entire message or handle unauthenticated
plaintext.
- No key commitment
Aside: All of these drawbacks are in common with AES-GCM and ChaCha20-Poly1305.
These TLS-oriented ciphers prioritize bytes on the wire and short-message
performance above all else, and in my opinion that makes them "hazmat" building
blocks, for experts only. BLAKE3-AEAD aims for the same use case and makes the
same tradeoffs. For a more general-purpose design with fewer sharp edges, see
[Bessie](https://github.com/oconnor663/bessie).
## Design
### Some BLAKE3 background
This design relies on two features of BLAKE3 in particular. First, BLAKE3 has a
built-in keyed mode. It works by substituting the caller's key in place of the
standard IV, and it doesn't require any extra compressions.
Second, BLAKE3 supports extendable output. Extended output blocks are produced
by incrementing the compression function's internal `t` parameter, which allows
for computing blocks in parallel or seeking to any point in the output stream.
This makes the extended output a natural stream cipher, and it also lets us use
seeking for domain separation, similar to the tweak in a tweakable block
cipher.
### Universal hash
The Python code samples in this document are excerpted from
[`blake3_aead.py`](python/blake3_aead.py).
```python
TAG_LEN = 16
BLOCK_LEN = 64
def universal_hash(key, message, initial_seek):
output = bytes(TAG_LEN)
for block_start in range(0, len(message), BLOCK_LEN):
block = message[block_start : block_start + BLOCK_LEN]
seek = initial_seek + block_start
block_output = blake3(block, key=key).digest(length=TAG_LEN, seek=seek)
output = xor(output, block_output)
return output
```
In other words:
- Split the message into 64-byte blocks, with the last block possibly short.
- Compute the keyed BLAKE3 hash of each block separately.
- For each block output, seek to the position in the output stream equal to
`initial_seek` plus the message position. Take 16 bytes.
- XOR all the 16-byte tags together to form the output.
The XOR structure makes it possible to compute all of the blocks in parallel.
The regular BLAKE3 tree structure is also parallelizable, but only at a
granularity of 1 KiB chunks, while `universal_hash` is parallelizable over
64-byte blocks. This is important for short-message performance.
The security properties of this function are intended to be similar to those of
GHASH from AES-GCM or (unmasked) Poly1305 from ChaCha20-Poly1305. ["Universal
hash"](https://en.wikipedia.org/wiki/Universal_hashing) is a somewhat academic
term, but suffice it to say that these are _much weaker_ than a regular keyed
hash like BLAKE3 or HMAC. They're as hazmat as hazmat gets.
The `initial_seek` parameter is used for domain separation below.
### Encrypt
```python
MSG_SEEK = 2**63
AAD_SEEK = 2**63 + 2**62
def encrypt(key, nonce, aad, plaintext):
stream = blake3(nonce, key=key).digest(length=len(plaintext) + TAG_LEN)
ciphertext_msg = xor(plaintext, stream[: len(plaintext)])
msg_tag = universal_hash(key, ciphertext_msg, MSG_SEEK)
aad_tag = universal_hash(key, aad, AAD_SEEK)
tag = xor(stream[len(plaintext) :], xor(msg_tag, aad_tag))
return ciphertext_msg + tag
```
The BLAKE3 XOF supports up to 264-1 output bytes. The output space
is divided into three parts, with the key stream starting at offset 0, the
message authenticator starting at offset 263, and the associated
data authenticator starting at offset 263+262. The stream
cipher produces 16 extra bytes of output beyond the message length, and those
extra bytes are used to mask the combined authentication tag.
### Decrypt
```python
def decrypt(key, nonce, aad, ciphertext):
plaintext_len = len(ciphertext) - TAG_LEN
ciphertext_msg = ciphertext[:plaintext_len]
stream = blake3(nonce, key=key).digest(length=len(ciphertext))
msg_tag = universal_hash(key, ciphertext_msg, MSG_SEEK)
aad_tag = universal_hash(key, aad, AAD_SEEK)
expected_tag = xor(stream[plaintext_len:], xor(msg_tag, aad_tag))
if not compare_digest(expected_tag, ciphertext[plaintext_len:]):
raise ValueError("invalid ciphertext")
return xor(ciphertext_msg, stream[:plaintext_len])
```
## Rationales
### Authenticator structure
Broadly speaking there are two options for constructing the authenticator:
- Compute it with the long-term key and mask it with the key stream (what this
design does).
- Compute it with a nonce-derived key and publish it unmasked.
Deriving a subkey would probably always incur a call to the compression
function, while masking is sometimes free, if the final block of the message
happens to be 48 bytes or less. Note that if you use a tag mask, you don't need
it until the end of the encryption process, so it's natural to get it from the
tail of the stream. But if you use derived subkeys, you need them at the start,
and using the tail of the stream would be awkward.
"Add 16 bytes to the end of the stream" is also easier for the implementation
to parallelize than "reserve a block at the front of the stream". Ideally you
get the whole stream from one function call, but then that call needs output
space to write the stream bytes. The most natural place is the caller's output
buffer, especially if you can XOR the stream directly over the plaintext. That
buffer already needs 16 bytes at the end reserved for the tag, so using those
bytes as scratch space is free. But asking the caller for a block of scratch
space at the front [would be awkward](https://nacl.cr.yp.to/secretbox.html).
Note that it's not safe to use `universal_hash` unmasked as-is, because its
output is all-zero when its input is empty, regardless of the key. We could
change that by adjusting the definition so that the empty input is still
considered one block. On the other hand, it's nice not to incur a call to the
compression function for the AAD when it's empty.
### Misuse resistance
It could've been nice to incorporate the MAC of the plaintext into the key
stream, to provide some resistance against nonce reuse. But that has
performance downsides: Encryption would require two passes over the input, and
the recipient would have to do all the work of decryption before rejecting a
bad packet.
### Seek constants
We could've divided the XOF output space into three approximately equal parts,
rather than the current arrangement where half of it is allocated to the key
stream. However, increasing the maximum message size from 262 bytes
to ~262.4 bytes has almost no practical value, and it's nicer to
keep the `MSG_SEEK` and `AAD_SEEK` constants simple.
### Nonce length
Supporting nonces larger than 64 bytes would be trivial for any implementation
that's built with a BLAKE3 library, and in fact the code samples above already
do (because some asserts have been omitted). However, not all implementations
will want to carry along the full BLAKE3 hash function. A compact
implementation might prefer to work directly with the compression function and
omit the tree hashing parts. Restricting nonces to 64 bytes allows for these
compact implementations, and 64 bytes is already quite generous. For
comparison, the extended nonces in XSalsa and XChaCha are only 24 bytes.