Crates.io | udv |
lib.rs | udv |
version | 0.3.1 |
source | src |
created_at | 2022-02-24 21:33:27.075613 |
updated_at | 2022-02-27 05:03:52.699735 |
description | Unambiguous Delimited Values. A smarter successor to CSV. |
homepage | |
repository | https://gitlab.com/Taywee/udv |
max_upload_size | |
id | 538774 |
size | 20,580 |
Unambiguous Delimited Values. Similar to CSV, but consistent, unambiguous, and predictable.
Uses leading delimiters and simple character escapes to allow simple and unambiguous introduction of units and records, unambiguous header declaration, unambiguous concatenation of documents, the ability to discern the differences between 0 fields and 1 blank field, and the ability to use arbitrary binary data.
This is encoding-agnostic, but delimiters are required to be a single codepoint. The obvious canonical representations are utf-8 and binary, but any encoding is possible.
The EBNF is like this, where the all-caps values are each a configurable single-codepoint delimiter:
stream = {garbage}, { message, {garbage} }, ENDSTREAM;
garbage = (* - (MESSAGE | HEADER | ENDSTREAM) )
message = [header], MESSAGE, { record }, ENDMESSAGE;
header = HEADER, units;
record = RECORD, units;
units = { UNIT, unit };
unit = { (* - control) | (ESCAPE, *) };
control = ENDSTREAM | HEADER | MESSAGE | ENDMESSAGE | RECORD | UNIT | ESCAPE;
The default delimiters:
HEADER = "#";
MESSAGE = ">";
ENDMESSAGE = "<";
RECORD = ? ASCII newline ?;
UNIT = ",";
ESCAPE = "\";
ENDSTREAM = "!";
For the most part, this is a prefix-oriented format. The two exceptions are the ENDMESSAGE and ENDSTREAM delimiters. The ENDMESSAGE delimiter allows text-editing a text-UDV message file without an inserted newline at the end causing problems (because the point of this format is to be unambiguous, a newline at the end of the file would have to be considered part of the last unit, or introducing an empty record). The ENDSTREAM delimiter allows UDV data to not depend on knowing the length ahead of time or relying on the physical geometry of the buffer that the stream is in. This is explicit, requiring any ENDSTREAM codepoints in a UDV message to be explicitly escaped, allowing any UDV parser to parse a UDV stream in an arbitrary location unmodified. This can also be leveraged to do something like embed a UDV stream at the end of the file, and put the UDV offset after ENDSTREAM so the beginning can be located from the trailing data.
#,id,name,value>
,1,taylor,developer
,2,namewith\,comma,valuewith\
newline<
>
,1,taylor,developer
,2,namewith\,comma,valuewith\
newline<
#,id,name,value><
#,id,name,value>
<
#,id,name,,value>
,,,,<
><
>
,<
>
,
,,<
This takes advantage of the fact that any amount of garbage data may appear before any MESSAGE, HEADER, or ENDSTREAM character, to allow trailing newlines to not cause issues.
#,id,name,value>
,1,taylor,developer
,2,namewith\,comma,valuewith\
newline<
>
,1,taylor,developer
,2,namewith\,comma,valuewith\
newline<
#,id,name,value><
#,id,name,value>
<
#,id,name,,value>
,,,,<
><
>
,<
>
,
,,<
!
!
HEADER = SOH;
MESSAGE = STX;
ENDMESSAGE = ETX;
RECORD = RS;
UNIT = US;
ESCAPE = ESC;
ENDSTREAM = EOT;
If you have a stream of mostly string messages, these rules can help serialize into a compact stream with as little escaping as possible. There is also a C0-utf8 mode that does this while maintaining valid utf-8.