Crates.io | lozizol |
lib.rs | lozizol |
version | 0.5.3-dev |
source | src |
created_at | 2021-02-19 21:06:31.628859 |
updated_at | 2021-03-06 11:13:38.63745 |
description | Back to basics, efficient event-sourcing protocol |
homepage | |
repository | |
max_upload_size | |
id | 357749 |
size | 153,530 |
Extensible and efficient serialization protocol.
Aiming at both wire and storage, multiplexed streams, concatenation, event sourcing...
Event sourcing? History is immutable and only it's interpretation may change between application versions. On the down side, the application needs to know how to interpret the records of all previous versions. On the up side, there is no need for migrations or data updates on application upgrade.
sequence
- a Lozizol sequence of entries with properties:
uuid
) - a UUID of the sequence.entry
- an item of a sequence with following properties:
type
- the meaning of the serialized entry data identified by a URIThese entry types are implied and default by Lozizol:
urn:lozizol:header
) is an entry to be placed at the beginning of the
sequence to convey Lozizol version, identifier and diagnostic information.
it implies removal of all explicit entry type assignemnts.
The implied type assignment is 111 (0x6F
)urn:lozizol:type
) is a reservation of a type number and assigned to
a URI
identifying the type of entry and perhaps providing detailed information.
Empty URI indicates that a type assignment is removed from the sequence.
The implied type assignment is 1urn:lozizol:deleted
) is an entry that should be considered removed form sequence.
The fixed type assignment is 0Other entry types are defined by applications with a URI and assigned in a Lozizol sequence using the type assignment entry. Type assignment can be assigned, reassigned or removed even in the middle of a Lozizol sequence. A type must be assigned or implied before it can be used in a Lozizol sequence.
The higher level implementations must work with the type URIs, rather than type IDs. As a result, one can assign the same type under multiple type IDs and they will all be treated equally.
byte
: a sequence of 8 bitsvuint
: variable length encoded size (non-negative integer).
abcd efgh ijkl mnop
in binary vuint becomes
1000 00ab 1cde fghi 0jkl mnop
.0x80
) are invalid. Implementations must treat the sequence as corrupt on encountering this.uri
: sequence of bytes representing a valid URI string per RFC 3986record
: the binary representation of an entry
:
size
(vuint
): the serialized type and data size (really - type vuint length included)type
(vuint
): the entry typedata
: binary representation of the entry according to typepadding
: all bytes in Lozizol sequence are either of records or padding.
Since a record by definition must have at least two bytes (size and type),
a null (0x00
) byte outside of a record would become a record of zero length and no type
and is therefore considered a single byte padding. Padding can be used to align
records to a boundary or can be a result of wiping of deleted records.lozizol serialize vuint 16777215 | xxd -b
and comare that with echo -n -e '\xFF\xFF\xFF' | xxd -b
. While this introduces some redundancy (vuint 0x8011
is the same as 0x11
), it is for the benefit of humans and easier implementation. Other schemes subtract/add 1
in the process of encoding/decoding, but that means the 0
s and 1
s cannot be checked visually and it adds some computing overhead.1
s in the first byte is more efficient to compute, but would break the wiping/padding semantics. If the first byte was deleted and other's not, there would be no way to tell what the byte means. Spreading the marker bit is essential.0x6C
is also a vuint length 108,
the full header size is thus 109 (1 + 108) bytes0x6F
is the predefined entry type 111Since the Lozizol header would typically be the first record stored in a file, it allows for magic bytes based file type detection. It is deliberately human readable so that unsuspecting explorers can learn about the nature of the stored and/or transferred data.
Without the Lozizol header the implementation must either:
Try
try_lozizol () {
data="zizol 0.5 039343bb-9019-4d6e-825e-631aa578ff7f lozizol-cli 0.0.1-preview and what not ....................."
lozizol serialize entry 111 "$data" | xxd
# lozizol magic:
lozizol serialize entry 111 "$data" | dd bs=1 count=109 status=none | cut -d ' ' -f1
# lozizol version:
lozizol serialize entry 111 "$data" | dd bs=1 count=109 status=none | cut -d ' ' -f2
# sequence ID:
lozizol serialize entry 111 "$data" | dd bs=1 count=109 status=none | cut -d ' ' -f3
# diagnostic info:
lozizol serialize entry 111 "$data" | dd bs=1 count=109 status=none | cut -d ' ' -f4-
}
try_lozizol
A Lozizol sequence is initialized with the header. The header sets the sequence id
and protocol version
. It resets all type assignments in the given sequence do default (unassign all and assign only the ones implied by lozizol).
Note: The Lozizol header type id 111 can be reassigned once the sequence is initialized.
size
(vuint
): the length of type
+ typeid
+ typeuri
entry_type
(vuint
): type assignment entry type (0x01
by Lozizol default unless reassigned)assigned_type
(vuint
): new type assignment non-zero numberuri
(uri
): a URI string of the typeTry
lozizol serialize vuint 63 | xxd # find that 63 serializes as 0x3F ('?')
lozizol serialize entry 1 "?urn:my-awesome-type" | xxd
or a specific shortcut
lozizol serialize type 1 63 urn:my-awesome-type | xxd
Note: The type assignment type id 1 can be reassigned.
vuint
): 1 + the size of the deleted data0x00
): entry deleted entry typeWarning: The deletion type id 0 MUST NOT be reassigned as it would break the padding semantics.
The protocol anticipates aggressive forward only readers following a forward only writer to be the most common scenario. This fits for both file operations and network streaming. Fast seeking forward is supported by prefixing the entries with length.
Multiplexing or random write access can be accommodated too with some caution applied by the application. Lazy cleanup and indexing writes can take place without affecting integrity and reads. Indexing can be built on top of Lozizol to enable random read access.
Use case: all the data required to serialize the entry is available. In other words it will not block excessively by waiting for serialized data to manifest (IO or intense serialization).
Examples:
The writer:
Record is considered committed when all data according to size is written.
The reader simply eagerly reads all available data. When it reaches the end of the record it produces the entry.
The reader:
Parallel approach is that the reader skips through the records by their length. actual record reading is delegated to asynchronous record readers by giving them the byte range to munch on.
Use case: the application needs to mark an already written record as deleted.
The aplication should have its own semantics of labeling data structures redundant, such as adding an entry that an entity has been deleted, in the spirit of event sourcing. Still, for archiving/cleanup purposes, the application should be able mark a record as deleted. See also Wiping and Secure erasure.
The deleted entry type (0
) is intended to be applied to already writen records to
mark them as deleted. As long as the single byte write operation can be
considered atomic, the deletion can be combined with reads and writes and
is safe for the readers as the reader is either before the type ID and will see
it as deleted, or is behind the type ID and will see the original record.
The data does not change, but with the entry type being reset, it is no longer possible to interpret it's meaning reliably. Readers should skip deleted records unless they are in the business of reclaiming unused space for compression or executing data erasure.
Deletion is a single byte write operation. To mark a record as deleted, put null (0x00
) byte at the position immediately
following the record length. If this operation cannot be executed in atomic fashion, deletions must block
readers and random access writers on the given record to ensure consistency.
Lozizol sequence can be compressed as a whole, like any other stream. A fitting
optimization is to wipe all deleted records with null (0x00
) bytes beforehand.
This can be done as a background job on a stored Lozizol sequence or passing through a wiping process.
The background wiping process:
0x00
) bytes in place of type,0x00
) bytes in place of data,0x00
) bytes in place of size.This will effectively create padding. It should be safe to fail at any moment.
The background wiping process can optionally optimize subsequent reads by creating deleted entry records to span the continuous blocks of padding.
The pass through (streaming) wiping process:
0x00
) bytes.Sensitive applications may require secure erasure of stored and deleted records.
This can be accomplished similarly to the background wiping process except that
before null (0x00
) filling the data, the secure erasure process runs adequate measures
to eliminate residual data traces from the storage media.
Before even considering random access asynchronous writes, think about your model. Can these writes be expressed as smaller entries written forward only? For example, rather than writing down the whole e-mail as it comes from the wire, which would beg an asynchronous write optimization, write each header block and each body block as a record tagged with the common mail ID and close with a terminating record (a.k.a. event sourcing).
Alternatively, simple parallelism can be achieved by partitioning the Lozizol sequences into multiple streams.
Still, there may be situations where asynchronous random access writes are desirable. For instance in case of UDP transfers, or intense serialization.
Use case: I really want to write asynchronously
There should be one authority over each Lozizol sequence that allocates records to be written. It will only write down the record length and type and hand over the byte range to an asynchronous writer. Immediately after that it can skip to the next available position and allocate the next record write request.
Since the integrity of the record cannot be asserted by reaching the end of data in this case, there must be a commit mechanism built into the application - either a hash as part of the record that will invalidate dirty reads or a separate commit record. Writing the commit record must block the authority. One simple commit mechanism would be to use one type for dirty records and another for fully written records. After writing, flushing, syncing, the writer would then update the record type. This is tricky because the type vuint length must not change. IDs of the same vuint length must be chosen.
For now it is left to the application do define suitable entry types and the rules for handling asynchronous writes with integrity guarantees.