Crates.io | neodyn_xc |
lib.rs | neodyn_xc |
version | 0.4.0 |
source | src |
created_at | 2021-03-29 08:04:27.428732 |
updated_at | 2022-07-05 06:01:41.889609 |
description | Neodyn Exchange is the serialization format for the Neodyn database engine |
homepage | https://h2co3.github.io/neodyn/ |
repository | https://github.com/H2CO3/neodyn.git |
max_upload_size | |
id | 375038 |
size | 434,944 |
Neodyn Exchange is a serialization format originally derived from the Serde data model. It is designed to be an easy-to-use, portable, and efficient means of data exchange between systems. Although the reference implementation is in Rust, implementations in other languages are expected and encouraged.
Like many Serde-compatible serialization formats, Neodyn Exchange is used primarily by calling top-level functions for different formats. The crate root re-exports the following functions:
// For serialization:
fn to_value() // serializes to an in-memory, loosely-typed `Value`
fn to_string() // serializes to a string, in the text representation
fn to_bytes() // serializes to a Vec<u8>, in the binary format
fn to_writer() // serializes to an io::Write, in the binary format
fn to_writer_buffered() // same as to_writer(), but uses buffering
// For deserialization
fn from_value() // deserializes from an in-memory, loosely-typed Value
fn from_value_ref() // same as above, but takes a borrowed &Value instead
fn from_str() // deserializes from the text representation
fn from_bytes() // deserializes from a byte slice in the binary format
fn from_reader() // deserializes from an io::Read in the binary format
fn from_reader_buffered() // same as above, but uses buffering for efficiency
There are more functions in the conventionally-named ser
and de
modules,
which, however, are less common (e.g. writing the text format as raw bytes).
The Value
enum is also exported at the top level; it represents all
possible values the Neodyn Exchange format can work with. Additionally,
Value
and &Value
implement the Deserializer
trait, as well as
Serialize
and Deserialize
, so you can use them as a last-resort escape
hatch if some types or other formats aren't flexible enough in their
implementation of (de)serialization.
Furthermore, the types implementing Serializer
and Deserializer
are also
public (however, they are not re-exported). However, some caveats apply to
them (e.g. some of them must be .finalize()
d manually), so please refer to
their respective documentation for more information.
The Neodyn Exchange data model has three equivalent representations:
Value
type in the Rust reference implementation;Each of these representations are capable of storing the same set of values. How exactly one interprets a serialized value may differ between applications. For example, due to the varying nature of the means of parsing each representation, the Serde reference implementation may produce different kinds of errors at different points in time when deserializing the structured, the human-readable, and the binary format, respectively.
Neodyn Exchange is a self-describing format, and as such, every serialized value carries its type. This also means that as long as one is satisfied with a loosely-typed value tree as the output of the parsers, no schema is needed for decoding a serialized value.
The possible types are the following:
null
, denoting the absence of an optional value.opt
, denoting the presence of a (wrapped) optional value.bool
, just a regular Boolean, either true
or false
.int
, 64-bit signed integer.uint
, 64-bit unsigned integer.float
, 64-bit IEEE-754 floating-point number, which may not be NaN
.string
, UTF-8 encoded, human readable text.blob
, an arbitrary, raw sequence of bytes.array
, an ordered collection of arbitrary nested values.map
, an unordered collection of key-value pairs, where both the keys
and the values are arbitrary nested Neodyn Exchange values.The abstract value tree representation pretty much reflects this system one-to-one.
Remark: The signed integer, unsigned integer, and floating-point types are
collectively called numbers. (Note in particular that bool
is not
considered a number.)
The text representation is defined to be a UTF-8 encoded string. The following is an informal (read: not rigorous), BNF-style grammar for this format.
Whitespace (as defined by Unicode) before the first token, after the last token, and between consecutive tokens is allowed and ignored. The exception is strings, which are treated as a single token, and whitespace between characters in a string must not be ignored. (This is just the usual semantics, because of course we want to preserve whitespace within a string.)
Details of the notation below:
VALUE
:=
, e.g. BOOL := 'true' | 'false'
'null'
|
, e.g. foo | bar
(VALUE)
?
, *
and +
denote optional (0 or 1 repetitions),
any (0 or more repetitions), and many (1 or more repetitions) items,
respectively, e.g. (KEY ':' VALUE ',')*
[0-9]
[0-9a-fA-F]{2}
<XID_continue>
VALUE := NULL | OPT | BOOL | INT | UINT | FLOAT | STRING | BLOB | ARRAY | MAP
NULL := 'null' WB
OPT := '?' VALUE
BOOL := ('true' | 'false') WB
INT := ('+' | '-') [0-9]+ WB
UINT := [0-9]+ WB
FLOAT := ('+' | '-')? (([0-9]* '.' [0-9]+) | ([0-9]+ '.' [0-9]*) | 'inf') WB
STRING := '"' ( UNESCAPED | ESCAPED )* '"'
UNESCAPED := <any Unicode codepoint except '"' and '\'>
ESCAPED := '\n' | '\r' | '\t' | '\\' | '\'' | '\"' | '\u{[0-9a-fA-F]+}'
BLOB := '#' ([0-9a-fA-F]{2})* '#'
ARRAY := '[' ( VALUE ( ',' VALUE )* ','? )? ']'
MAP := '{' ( PAIR ( ',' PAIR )* ','? )? '}'
PAIR := VALUE ':' VALUE
WB := <Unicode word boundary or ASCII punctuation>
Notes:
The human-readable format is case sensitive. All built-in literal words are written in lowercase. (The case of strings is of course preserved.)
All numbers are written in decimal. The human-readable format does not allow bases other than 10.
Numbers and words (e.g. literal null
, true
, false
, and inf
) must
be separated from their surroundings by Unicode word boundaries or ASCII
punctuation. The purpose of this constraint is to disallow strings that
would be hard to parse for a human reader, e.g. 123null
shouldn't be
accepted as two valid tokens 123
and null
, because they are not
visually separate.
There is a distinction between signed and unsigned integers, i.e. +42
is
not the same as 42
. This means that the textual format can preserve the
type of a positive signed integer, which is distinct from an unsigned integer.
Floating-point numbers must always have digits on at least one side of the
decimal point. Canonically, they are printed with at least one digit on
both sides on the decimal point, even if that digit happens to be a zero,
e.g. -0.3
or +12.0
. The sign of a floating-point number is optional,
but it is always printed in the canonical representation.
Floating-point numbers are capable of representing positive and negative
infinity, and they preserve the sign of zero, but NaN
is disallowed.
NaN
values are converted to null
in all three representations of
the Neodyn Exchange format. This means, however, that NaN
does not
round-trip if serialized then deserialized, because attempting to
deserialize null
into a float is treated as an error.
Leading zeroes in all three numeric types are allowed and ignored,
except for floating-point inf
. Trailing zeroes in the fractional part
of floating-point numbers are also allowed and ignored.
The canonical representation is not to print any leading or trailing zeroes, except when the number (for integers) or its integer or fractional part (for floating-point numbers) is equal to zero. In these cases, exactly one zero digit is printed in each place where it is needed.
Trailing commas in arrays and maps are allowed and ignored. The canonical format is to always print trailing commas.
Whitespace is allowed between pairs of hex digits in a blob literal, but hex digits within a pair (i.e. those denoting the high and low nibble of the same byte) must not be separated by whitespace.
Hex digits in a blob literal and in Unicode escape sequences are case
insensitive and the lowercase and uppercase characters a...f
and A...F
are pairwise equivalent. The canonical representation is lowercase.
String literals are allowed to contain "raw" newlines and tabs. These are interpreted as-is. The canonical representation is to always escape newlines and tabs.
Escape sequences are interpreted according to Rust's rules. This means that
mostly the usual interpretation applies (e.g. \n
denotes a newline, \t
is a tab, etc.), with the addition of \u{...}
, where ...
denotes a
sequence of consecutive hex digits (case insensitive) that is parsed
as a Unicode code point. Leading zeroes are allowed, and the canonical
representation is not to print any leading zeroes.
Example for the text representation:
[
{
+39: -.354,
-1.: true,
+3.142: -6.283,
0: null,
1: ?"an optional string",
2: ??"two levels of optionals; even an optional null is allowed, e.g.:",
null: ?null,
"as you can see": "null is allowed to be a key as well",
"escaped\nnewline": "unescaped
newline",
["arrays","and","maps"]:{"can":"be","keys":"too"},
"this is a map": "with a trailing comma",
},
{
"optional array": ?[
"first",
"second",
],
"empty map": {},
"array without a trailing comma": [1, 2, 3],
"this is a map": "also without a trailing comma"
},
]
The binary format tries to pack values into as little space as possible. It is, however, a strictly byte-oriented format, i.e. it operates on units of at least 8 bits and assumes 8-bit bytes. It employs techniques such as variable-width integer encoding and string interning in order to achieve small serialized object sizes. Consequently, although the conceptual type system describes 64-bit integers and floating-point numbers, values of sufficiently low magnitude and/or sufficiently few significant digits may actually use less space when encoded.
Every value starts with a one-byte tag containing at least type information, but often other data such as length or an inline small value.
The types are organized hierarchically. There are three levels of types:
The type annotations and these additional pieces of information occupy well-defined places within a byte:
To sum up, the possible formats for type/value tags are:
XXX YYY ZZ
, (X = major bits, Y = minor bits,
Z = value tag bits)XXX YYY NN
(X = major bits, Y = minor bits,
N = bits of base-2 logarithm of the length of the data following)XXX VVV VV
(X = major bits, V = payload data bits)A serialized Neodyn Exchange value in the binary format consists of two parts:
Strings and binary blobs are stored in a compressed manner: only one copy of each identical byte array is ever written. These uniqued byte arrays are arranged into a "symbol table" at the beginning of the serialized data. If there are no interned strings or blobs in a value, this header may be omitted (and it is omitted by the reference implementation).
Note that if a string and a blob have the exact same data payload, they are uniqued to the same slot in the symbol table. Also note that empty strings and blobs are never interned and they are denoted by special inline "empty string" and "empty blob" markers, respectively, in the value body.
Furthermore, Each symbol table entry declares whether the corresponding data payload can (and will) be used as a string too, or only as a blob. This helps the deserializer avoid trying to parse arbitrary binary data as UTF-8.
If a symbol table entry is referenced more than once, the number of references, the so-called "use count" is also stored along with the payload. This makes it possible for the deserializer to hand over the owned buffer to the consumer upon the last reference to the symbol, sparing a copy and an allocation. (This optimization only matters when the deserializer works with owned buffers.)
After the symbol table, all other data are stored in the value body. It contains atomic values (e.g. booleans, numbers, etc.) as well as arrays and maps, inline. Strings and blobs are referenced by their index into the symbol table.
The only case when more than one value is present in the body is when there is a complex type (an array or a map), recursively containing other values. The values in an array follow each other sequentially. The keys and values of a map are interleaved: the first entry is the first key followed by the first value, then the second key and the second value, and so on.
Numerically equal type tags can have different interpretations depending on whether they are found in the header / symbol table or in the actual value body.
The following lists all of the possible major and minor types and value tags.
As previously, NN
denotes log-length bits, and VVV VV
stands for inline
small integer bits.
First, type tags shared between the header and the body:
Bits | Meaning |
---|---|
000 000 NN |
Symbol Table Start |
Second, type tags for the header / symbol table:
Bits | Meaning |
---|---|
010 VVV VV |
Short Blob, Single-use |
011 VVV VV |
Short Blob, Multiple uses |
100 VVV VV |
Short String, Single-use |
101 VVV VV |
Short String, Multiple uses |
111 010 NN |
Long Blob, Single-use |
111 011 NN |
Long Blob, Multiple uses |
111 100 NN |
Long String, Single-use |
111 101 NN |
Long String, Multiple uses |
Third, type tags for the value body:
Bits | Meaning |
---|---|
000 001 00 |
null |
000 001 01 |
Optional Present |
000 001 10 |
false |
000 001 11 |
true |
000 010 00 |
Empty string (inline) |
000 010 01 |
Empty blob (inline) |
001 VVV VV |
Small signed integer |
010 VVV VV |
Small unsigned integer |
011 VVV VV |
Low-index string |
100 VVV VV |
Low-index blob |
101 VVV VV |
Small array |
110 VVV VV |
Small map |
111 001 NN |
Signed integer |
111 010 NN |
Unsigned integer |
111 011 NN |
String |
111 100 NN |
Blob |
111 101 NN |
Array |
111 110 NN |
Map |
111 111 NN |
Floating-point number |
First some general notes about the variable-width number encoding:
If the major type of the type tag denotes a "small" value, then the 5 least significant bits (#0-#4) contain the integer value inline, either as an unsigned or as a 2's complement signed bit sequence.
Therefore, the range of inline small integers is:
If the major type of the type tag denotes a "big" value, then the
minor type tag determines the actual (sub)type of the value, and the
2 least significant bits use the aforementioned log-length encoding
to refer the size of the following integer or floating-point number.
Thus, the lower two bits might be either of 00
, 01
, 10
, 11
,
which means a subsequent number size of 1, 2, 4, or 8 bytes,
respectively. 2, 4, and 8-byte numbers are stored as little-endian.
Thus, an encoded number may occupy a total of either 1, 2, 3, 5, or 9 bytes, depending on its magnitude.
Next, the format of the symbol table is elaborated.
If the symbol table is present, the value must start with a marker byte with
the major type "simple" and the minor type "symtab", i.e. 000 000 NN
.
If the value does not start with such a byte, then no symbol table is read.
The last two bits, NN
, define the base-2 logarithm of the number of bytes
needed for encoding the number of entries in the symbol table. I.e.,
Symtab start byte | Width of integer encoding symbol count |
---|---|
000 000 00 |
1 |
000 000 01 |
2 |
000 000 10 |
4 |
000 000 11 |
8 |
Note that this means that the symbol table count doesn't have a compact, inline (1-byte) format for storing the count; the total size is always 2, 3, 5, or 9 bytes, as the "start byte" is present unconditionally.
So, the symbol count follows, as a little-endian unsigned integer. For example, a symbol table with 512 entries might begin like this:
+---------------+-------------+-------------+
| 000 000 01 | 0000 0000 | 0000 0010 |
+---------------+-------------+-------------+
| Symtab start, | count, | count, |
| 2^1 = 2 bytes | low byte | high byte |
| of length | | |
| follow | | |
+---------------+-------------+-------------+
Then, each entry follows in sequence, without any additional padding in between.
Symbol table entries consist of, in this order:
The first 1...9 bytes thus contain information about the following:
string
or blob
The bit pattern for each of the type tags can be found in the table of all type tags in the previous section.
So, for example, a 64-byte-long string that is only referred to once in the value body (and thus has an implicit use count) might be encoded as follows:
+--------------+-----------+------------------------+
| 111 100 00 | 0100 0000 | 64 bytes of UTF-8 data |
+--------------+-----------+------------------------+
| long string, | length, | actual string payload |
| single-use, | 64 bytes | |
| 2^0 = 1 byte | | |
| of length | | |
| follow | | |
+--------------+-----------+------------------------+
Meanwhile a 17-byte-long blob that is used 130 times in the body can be encoded in the following manner:
+--------------+------------+-----------+---------------------+
| 011 100 01 | 111 010 00 | 1000 0010 | 17 arbitrary bytes |
+--------------+------------+-----------+---------------------+
| short blob, | use count: | 130 | actual blob payload |
| multi-use, | unsigned | | |
| 17 bytes | integer, | | |
| long | 2^0 = 1 | | |
| | byte wide | | |
+--------------+------------+-----------+---------------------+
Non-interned data, i.e. values of every type other than strings and blobs, are stored inline in the body, which directly follows the header.
The exact format of each type of value is reproduced below.
Type Tag | Value | Following additional data |
---|---|---|
000 001 00 |
null |
None (tag-only) |
000 001 01 |
Optional | Wrapped value |
000 001 10 |
false |
None (tag-only) |
000 001 11 |
true |
None (tag-only) |
000 010 00 |
Empty String | None (tag-only) |
000 010 01 |
Empty Blob | None (tag-only) |
001 VVV VV |
Small Int | None (5-bit value inline) |
010 VVV VV |
Small Uint | None (5-bit value inline) |
011 NNN NN |
Small String | None (5-bit symtab index inline) |
100 NNN NN |
Small Blob | None (5-bit symtab index inline) |
101 NNN NN |
Small Array | Array items (5-bit count inline) |
110 NNN NN |
Small Map | Key-value pairs (5-bit count inline) |
111 001 NN |
Big Int | 2NN bytes of 2's complement |
111 010 NN |
Big Uint | 2NN bytes of unsigned int |
111 011 NN |
Big String | 2NN bytes of symtab index |
111 100 NN |
Big Blob | 2NN bytes of symtab index |
111 101 NN |
Big Array | 2NN bytes of count, then items |
111 110 NN |
Big Map | 2NN bytes of count, then pairs |
111 111 NN |
Floating-point | 2NN (4 or 8) bytes of IEEE-754 |
Consider the MsgPack front page example in the human-readable format of Neodyn Exchange:
{"compact": true, "schema": 0}
In the binary format this is represented as:
Address | Byte | Explanation |
---|---|---|
00 |
000 000 00 |
Symtab start, 200 = 1 byte of length |
01 |
000 000 10 |
2 symbols in symtab |
02 |
100 001 11 |
Short string, single-use, 7 bytes |
03 |
011 000 11 |
'c' |
04 |
011 011 11 |
'o' |
05 |
011 011 01 |
'm' |
06 |
011 100 00 |
'p' |
07 |
011 000 01 |
'a' |
08 |
011 000 11 |
'c' |
09 |
011 101 00 |
't' |
10 |
100 001 10 |
Short string, single-use, 6 bytes |
11 |
011 100 11 |
's' |
12 |
011 000 11 |
'c' |
13 |
011 010 00 |
'h' |
14 |
011 001 01 |
'e' |
15 |
011 011 01 |
'm' |
16 |
011 000 01 |
'a' |
17 |
110 000 10 |
2-element map |
18 |
011 000 00 |
String #0 |
19 |
000 001 11 |
true |
20 |
011 000 01 |
String #1 |
21 |
010 000 00 |
Unsigned integer 0 |