Crates.io | rustbif |
lib.rs | rustbif |
version | 0.1.1 |
source | src |
created_at | 2024-08-01 07:01:49.34436 |
updated_at | 2024-08-03 02:27:56.363213 |
description | Rust Binary Format |
homepage | |
repository | https://github.com/ikhomyakov/cerdito |
max_upload_size | |
id | 1321687 |
size | 70,613 |
rustbif
- The Rust Binary FormatThis binary serialization format is designed to encode a subset of Rust's algebraic data types, focusing on maintaining a high degree of efficiency, compactness, and backward and forward compatibility. The data structures can evolve according to these two rules:
Fix enum tags: limit to positive numbers up to u32, allow const expressions in discriminators. Negative disriminators? Consider removing support for explicit variant discriminators.
Errors (remove unwraps and panics), figure out if need to insert ?;Ok(...)
or return directly.
Decl macro for tuples.
[T; N] doesn't implement Default and user can't implement it either. Default is required when "new" program reads "old" data. Need to address it.
Implement Arr<bool, N>, Arr<char, N>, Arr<u8, N>, ..., Arr<u128, N>, Arr<i8, N>, ..., Arr<i128, N>, Arr<f32, N>, Arr<f64, N>, VArr
Consider using Result<T, E> in your datastructures -- this is a transparent (or not?) wrapper that wont appear on the wire and that will only capture success/failure of sub-structure decoding. How to encode data that has Err()s?
Consider using attribute (default) to control compat behaviour at the runtime: when a field is missing, return Ok(default()) OR Error(); Or try to figure out if the type impl Default or not. Explicit attribute could be better, for example, the attr can also provide value for the default.
Each data type is encoded as an element, which can be one of the following: varint
—a data of up to 128 bits, varbytes
—a variable-size byte sequence, varstruct
—a variable-length sequence of elements (e.g. struct, tuple, list), or varenum
- a tagged element.
varint: Encodes up to 128 bits of data.
0b1110LLLL
, where LLLL
+ 1 (1..=16) indicates the number of following bytes that encode the number. Note that this is a variable size encoding, for example, u32
integer depending on its value can be encoded using 1, 2, 3 or 4 bytes.varenum: Encodes a tagged element (supports tags up to 32 bits)
0b111111MM
, where MM
+ 1 (1..=4) indicates the number of following bytes that encode the tag.varbytes: Encodes a variable-sized byte sequence of up to 2^64 bytes.
varint
0 (the wire format doesnt make difference if this a number 0 or en empty sequence).0b10LLLLLL
followed by the bytes, where LLLLLL
+ 1 is the length (1..=64) of the byte sequence.0b11110MMM
, followed by up to MMM
+ 1 (1..=8) varint bytes encoding the length of the sequence, then the sequence itself.varstruct: Encodes a variable length sequence of elements of up to 2^32 elements.
varint
0 (the wire format doesnt make difference if this a number 0 or en empty sequence).0b110LLLLL
followed by LLLLL
+ 1 (1..=32) elements.0b111110MM
, followed by up to MM
+ 1 (1..=4) varint bytes encoding the length of the sequence, then the elements.Primitive types:
u8
, u16
, u32
, u64
, u128
: Serialized to LE bytes and then encoded as varint
i8
, i16
, i32
, i64
, i128
: Converted to unsigned int using zigzag encoding (see below) and then encoded as unsigned ints.f32
, f64
: Serialized to BE bytes and then encoded using varint
.bool
: Converted to u8
(false: 0, true: 1) and then encoded as u8
.char
: Converted to u32
and then encoded as u32
.Opaque arrays
String
: Encoded as varbyte
.ByteVec
(same as VArr<u8>
): Encoded as varbyte
.Arr<bool, N>
, Arr<char, N>
, Arr<u8, N>
, ..., Arr<u128, N>
, Arr<i8, N>
, ..., Arr<i128, N>
, Arr<f32, N>
, Arr<f64, N>
, VArr<bool>
, VArr<char>
, VArr<u8>
, ..., VArr<u128>
, VArr<i8>
, ..., VArr<i128>
, VArr<f32>
, VArr<f64>
: Encoded as varbyte
, array data is encoded using fixed-size LE order of corresponding primitive data types.Generic arrays:
[T; N]
: Encoded as varstruct
.Vec<T>
: Encoded as varstruct
.Tuples, structs and enums:
()
, (T1,)
, (T1, T2, ..., T32)
: Encoded as varstruct
.Struct(T1, ..., TN)
: Encoded as (T1, ..., TN)
.Struct{ f1: T1, ..., fN: TN }
: Encoded as (T1, ..., TN)
.Enum{ V1 = d1, ..., Vi(T1, ..., TM) = di, ..., VN = dN }
: Variant i
is encoded as varint
if it is a unit variant. Otherwise, it is encoded as varenum
, i.e. enum tag followed by variant struct.If we encode the following data structure:
(
SampleEnum::B {
a: 'A',
b: SampleStruct {
a: String::from("hello, world!"),
b: 15,
},
},
(),
)
Where the struct and thge enum are defined as follows:
#[derive(Debug, Default, Encode, Decode)]
struct SampleStruct {
a: String,
b: i32,
}
#[repr(u8)]
#[derive(Debug, Default, Encode, Decode)]
enum SampleEnum {
#[default]
None,
A(String) = 10,
B {
a: char,
b: SampleStruct,
} = 20,
}
We will produce the following binary code:
c1 74 c1 41 c1 8c 68 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 1e 00
Where:
0xc1 0b11000001 - varstruct tuple, its 2 elements follow
0x74 0b01110100 - varenum with tag 20, its 1 element follow
0xc1 0b11000001 - varstruct struct "SampleEnum::B", its 2 fields follow
0x41 - field a: varint encoded char 'A'
0xc1 0b11000001 - field b: varstruct struct "SampleStruct", its 2 fields follow
0x8c 0b10001100 - field a: varbyte string, its 13 bytes follow
0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x2c, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x21 - "hello, world!"
0x1e - zigzag encoded integer 15
0x00 - empty varstruct encoded as 0
Zigzag encoding takes a signed integer and encodes it as an unsigned integer. It does so by counting up, starting at zero, alternating between representing a positive number and a negative number.
To encode any signed integer, x
, with a representation length—in bits,
n
, the formula is as follows:
(x >> n - 1) ^ (x << 1)
Note: The ^
operator is bitwise XOR.
To convert a zigzag-encoded unsigned integer, x
, to its decoded signed
counterpart, the formula is as follows:
(x >>> 1) ^ -(x & 1)
Note: The >>>
operator represents a logical right shift as opposed
to an arithmetic right shift. In Rust, unsigned integer types implement the
right shift operator as logical instead of arithmetic. Therefore, the
formula in Rust is simplified as (x >> 1) ^ -(x & 1)
.
This encoding format is not fully self-describing, meaning the original data structure cannot be reconstructed solely from the binary format. However, it includes enough information to support automatic backward and forward compatibility under certain constraints:
When a new program reads old data, it assigns default values to any new struct fields. Conversely, an old program reading new data will skip unknown struct fields and signal an error if it encounters an unsupported new enum variant. This ensures a high degree of robustness as data structures evolve over time and eliminates the need for tedious management of data structure versioning.