## Smaller serialization for RISC Zero
This repository implements a different algorithm for RISC Zero's serialization and has been
circulated for discussion in https://github.com/risc0/risc0/pull/1303.
There is a chance that this algorithm may become the official serializer in RISC Zero, but there are
issues pending to be resolved.
In the meantime, developers can use this serializer in their own implementation for customized serialization and
deserialization.
### Motivation
RISC Zero serializes the input to the zkVM into a vector of `u32`, through the `serde` framework.
This, however, means that it suffers from one of the major open problems in Rust.
When `serde` implements `Serialize` and `Deserialize` for Rust data structures, it has the following implementation:
```rust
impl Serialize for Vec where T: Serialize
{
#[inline]
fn serialize(&self, serializer: S) -> Result
where S: Serializer,
{
serializer.collect_seq(self)
}
}
```
And `collect_seq` has the following default implementation, meaning that the element would be serialized one after the
other, as a concatenation.
```rust
pub trait Serializer: Sized {
fn collect_seq(self, iter: I) -> Result
where I: IntoIterator, ::Item: Serialize,
{
let mut iter = iter.into_iter();
let mut serializer = tri!(self.serialize_seq(iterator_len_hint(&iter)));
tri!(iter.try_for_each(|item| serializer.serialize_element(&item)));
serializer.end()
}
}
```
This would impact RISC Zero because now, to serialize a byte array `Vec`, each number would be converted into u32, and an
array of `Vec` are to be serialized, which leads to 4x storage overhead.
Back to the serde discussion. The issue is that we may want to specify different rules for different `T`, particularly, if `T = u8`, for better
efficiency. One solution is the [serde_bytes](https://docs.rs/serde_bytes/latest/serde_bytes/index.html) crate, which
allows one to bypass the limitation through a customized serde function.
```rust
use serde::{Deserialize, Serialize};
#[derive(Deserialize, Serialize)]
struct Efficient<'a> {
#[serde(with = "serde_bytes")]
bytes: &'a [u8],
#[serde(with = "serde_bytes")]
byte_buf: Vec,
#[serde(with = "serde_bytes")]
byte_array: [u8; 314],
}
```
The idea is to implement a different implementation strategy that does not apply a generic rule to `Vec` (for any serializable `T`), and
asks the developer to switch between these two strategies. There are a few problems with this approach:
- It requires developers to modify the layers and layers of abstractions. If a developer uses four crates---A, B, C,
D---A depends on B, B depends on C, C depends on D, and D uses `Vec`, the developer needs to go all the way down to
D to change the definitions of its data structures. This can cause compatibility issues with the rest of the system.
- It is very difficult to be comprehensive because `Vec` is extremely common in Rust data structures, and a developer
would need to do a few passes of the code in order to clean this issue.
- If it requires significant patching, it would be very difficult to sync the code with their original repositories.
This not only increases maintenance overhead, but in practice often leads to security issues.
People's hope for this problem rests on [specialization](https://rust-lang.github.io/rfcs/1210-impl-specialization.html),
but there are limited chances that this can become stable any time soon. Use nightly for production environment is heavily
discouraged, and would not be suitable for RISC Zero's zkVM because it is in the process of becoming part of Rust.
## Our solution
Prior discussion shows that a bottom-up approach can be disastrous. This repository suggests a top-down approach, or in
other words, we do not require any modification to the existing data structures implemented in Rust, but instead, we
present a serializer that converts RISC Zero input into `Vec`, with the corresponding deserialization.
The idea is to have a side buffer, which we call `ByteBuffer`, managed by `ByteHandler`. When it observes several
continuous u8 being serialized, it tries to put them together rather than having each of them occupying a word.
The byte buffer consists of four bytes. So when there are four bytes in the buffer, a word would be produced, and the
buffer would be emptied. When something other than a byte is being serialized, the byte handler would immediately emit
a word and clean up the buffer.
Detail of the implementation can be found in the codebase. Below we summarize the main changes in the code.
### Serialization
The old code for `serialize_u8` and `serialize_u32` are good examples to compare the code before and after,
other than the code for the new finite-state automata.
```rust
// Old
impl<'a, W: WordWrite> serde::ser::Serializer for &'a mut Serializer {
fn serialize_u8(self, v: u8) -> Result<()> {
self.serialize_u32(v as u32)
}
fn serialize_u32(self, v: u32) -> Result<()> {
self.stream.write_words(&[v]);
Ok(())
}
}
// New
impl<'a, W: WordWrite> serde::ser::Serializer for &'a mut Serializer {
fn serialize_u8(self, v: u8) -> Result<()> {
self.byte_handler.handle(&mut self.stream, v)
}
fn serialize_u32(self, v: u32) -> Result<()> {
self.byte_handler.reset(&mut self.stream)?;
let res = self.stream.write_words(&[v]);
if res.is_err() {
return Err(Error::from(res.unwrap_err()));
} else {
return Ok(res.unwrap());
}
}
}
```
The main changes are `self.byte_handler.handle()` and `self.byte_handler.reset()`.
- `Handle` passes over a byte to the byte handler so that this byte would be emitted together with other bytes into a
word when appropriate.
- `Reset` tells the byte handler that something other than a byte is going to be serialized, and whatever in the buffer
must be emitted, and the buffer needs to be emptied.
### Deserialization
Similarly, the code change consists of redirection on `deserialize_u8` and insertions of
`activate_byte_buf_automata_and_take!(self)` and `deactivate_byte_buf_automata!(self)` macro calls.
```rust
// old
impl<'de, 'a, R: WordRead + 'de> serde::Deserializer<'de> for &'a mut Deserializer<'de, R> {
fn deserialize_u8(self, visitor: V) -> Result
where
V: Visitor<'de>,
{
visitor.visit_u32(self.try_take_word()?)
}
fn deserialize_u128(self, visitor: V) -> Result
where
V: Visitor<'de>,
{
let mut bytes = [0u8; 16];
self.reader.read_padded_bytes(&mut bytes)?;
visitor.visit_u128(u128::from_le_bytes(bytes))
}
}
// new
impl<'de, 'a, R: WordRead + 'de> serde::Deserializer<'de> for &'a mut Deserializer<'de, R> {
fn deserialize_u8(self, visitor: V) -> Result
where
V: Visitor<'de>,
{
visitor.visit_u8(self.byte_handler.handle_byte(&mut self.reader)?)
}
fn deserialize_u128(self, visitor: V) -> Result
where
V: Visitor<'de>,
{
self.byte_handler.reset()?;
let mut bytes = [0u8; 16];
self.reader.read_padded_bytes(&mut bytes)?;
visitor.visit_u128(u128::from_le_bytes(bytes))
}
}
```
### Handling one corner case
The high-level plan described above has a limitation. Since the byte handler is withholding bytes, and that the serializer
would not notify the byte handler when it reaches the end of serialization, there is a situation when the byte handler
is unable to emit the bytes into a word because the serialization has been completed.
This is a tricky issue that requires special attention. Our strategy is to observe that, first of all, we can split all types in Rust into three groups.
- primitive types:
* bool, u8, u16, u32, u64, u128, i8, i16, i32, i64, i128, f32, f64, char, str,
* ()
* unit struct aka "struct Nothing"
* unit enum aka "enum{ A, B, C }" with no underlying variables
- "newtype", a virtually pseudo type defined over another type, which is specifically defined for efficiency
* Option, which can be considered as `enum { None, Some(T) }`
* newtype struct, aka `struct(T)`
* newtype variant, aka `enum { struct1(T1), struct2(T2), ... }`
- wrapper:
* seq, including vec![], long slice
* tuple aka `(T1, T2)`
* tuple struct aka `struct(T1, T2)`
* tuple variant aka `enum { struct1(T1, T2), struct2(T3, T4, T5) }`
* map
* struct aka `struct { a: A, b: B}`
* struct variant aka `enum { struct1 {a: T1, b:T2}, struct2{a: T3, b: T4, c: T5) }`
Note that if the buffered bytes have not been fully written to the stream, it means that the last primitive type being
written has to be either bool or u8 (we will just treat bool as u8 in the following). There are no primitive types after it,
otherwise it would be written to the stream because the byte handler would be reset.
So, there are only two possibilities of this u8.
- the variable to be serialized is directly or indirectly inside some sort of a wrapper. By "indirectly", it means that `struct { Option)> }` will also be considered as "inside a wrapper".
- the variable to be serialized is not inside any wrapper. It could be `u8` or `Option)>`.
We handle both separately.
For the first case, we introduce a notion of "depth". When serializer enters a wrapper, the depth is increased by one, when it leaves the wrapper, the depth is decreased by one. When the depth is zero, it means that it must have left the last layer of meaningful wrapper, and there are no new bytes possible after it. It needs to be immediately written to the stream when the depth hits zero. The implementation looks like this:
```rust
#[inline]
fn decrease_depth(&mut self, stream: &mut W) -> Result<()> {
self.depth -= 1;
if self.depth == 0 && self.status != 0 {
stream.write_words(&[self.byte_holder])?;
self.status = 0;
}
Ok(())
}
```
For the second case, we notice that the last u8 is in depth 0, and therefore, this u8 is put into the buffer and immediately being written down into the stream, i.e., treating u8 as u32.
```rust
fn handle(&mut self, stream: &mut W, v: u8) -> Result<()> {
if self.depth == 0 {
stream.write_words(&[v as u32])?;
} else {
......
}
Ok(())
}
```
### Result
The integration test [here](src/integration_test.rs) can provide more information. But in our example, we use a struct
that has a member `Vec` and another member `Vec>`.
```rust
fn test() {
// ...
test_s.strings = b"Here is a string.".to_vec();
test_s.stringv = vec![b"string a".to_vec(), b"34720471290497230".to_vec()];
// ...
}
```
Our experiment shows that it can correctly serialize them into the compact format in `Vec`.
### Updates
This algorithm has been completed changed several times to handle different corner cases.
It is necessary to credit @austinabell, @flaub, and @nategraf for the thought process of leading to this new algorithm.
Readers interested in the algorithm can check the pending PR: https://github.com/risc0/risc0/pull/1303.
### License
The code largely comes from RISC Zero's implementation [here](https://github.com/risc0/risc0/tree/main/risc0/zkvm/src/serde),
with modifications necessary to add the finite state automata.
Since RISC Zero is under the Apache 2.0 license, this repository would also be Apache 2.0.