Bencodex: Bencoding Extended
============================

*The version of this document is **1.3**.  See also [changelog].*

*There is a list of implementations.  See also [LIBRARIES.tsv](./LIBRARIES.tsv)
file.*

Bencodex is a serialization format that extends BitTorrent's [Bencoding].
Since it is a superset of Bencoding, every valid Bencoding representation is
a valid Bencodex representation of the same meaning (i.e., represents the same
value).  Bencodex adds the below data types to Bencoding:

 -  null
 -  Boolean values
 -  Unicode strings besides byte strings
 -  Dictionaries with both byte and Unicode string keys

[Bencoding]: http://www.bittorrent.org/beps/bep_0003.html#bencoding
[changelog]: ./CHANGES.md


Why not *[insert your favorite format here]*
--------------------------------------------

The unique feature of Bencoding is forced normalization.
According to Wikipedia's [Bencode] page:

> For each possible (complex) value, there is only a single valid bencoding;
> i.e. there is a [bijection] between values and their encodings.
> This has the advantage that applications may compare bencoded values by
> comparing their encoded forms, eliminating the need to decode the values.

This makes things really simple when an application needs to determine
if encoded values are the same, in particular, with cryptographic hash or
digital signatures.

There have been countless improvements in data serialization like
rich data types, human readability, compact binary representation,
zero-copy serialization, and even streaming, but canonical representation
is still not well counted.

Bencodex actually does not aim high in ambition; it purposes to merely
leverage Bencoding's good things with average-level data types of modern
serialization formats.

[Bencode]: https://en.wikipedia.org/wiki/Bencode#Features_&_drawbacks
[bijection]: https://en.wikipedia.org/wiki/Bijection


Encoding
--------

Note that notations for the semantics (i.e., the values that encodings
represent) use Python's literals.

 -  Null is represented by `n` (`6e`).

 -  Boolean true is represented by `t` (`74`),
    and false is represented by `f` (`66`).

 -  Byte strings are length-prefixed base 10 followed by a colon and
    the byte string.

    For example, `4:spam` (`34 3a 73 70 61 6d`) corresponds to `b"spam"`.

 -  Unicode strings are represented by `u` followed by UTF-8 byte length
    base 10 and UTF-8 encoding of the Unicode string.

    For example, `u6:단팥` (`75 36 3a eb 8b a8 ed 8c a5`) corresponds to
    `u"\ub2e8\ud325"`.

 -  Integers are represented by an `i` followed by the number in base 10
    followed by an `e`.

    For example, `i3e` (`69 33 65`) corresponds to `3`,
    and `i-3e` (`69 2d 33 65`) corresponds to `-3`.

    Integers have no size limitation.

    `i-0e` (`69 2d 30 65`) is invalid.  All encodings with a leading zero,
    such as `i03e` (`69 30 33 65`), are invalid, other than `i0e` (`69 30 65`),
    which of course corresponds to `0`.

 -  Lists are encoded as an `l` followed by their elements (also represented in
    Bencodex) followed by an `e`.

    For example, `l4:spamu4:eggse` (`6c 34 3a 73 70 61 6d 75 34 3a 65 67 67 73
    65`) corresponds to `[b"spam", u"eggs"]`.

 -  Dictionaries are encoded as a `d` followed by a list of alternating keys
    and their corresponding values followed by an `e`.

    For example, `d3:cowu3:moou4:spam4:eggse` (`64 33 3a 63 6f 77 75 33 3a 6d
    6f 6f 75 34 3a 73 70 61 6d 34 3a 65 67 67 73 65`) corresponds to
    `{b"cow": u"moo", u"spam": b"eggs"}`, and `du4:spaml1:au1:bee` (`64 75 34
    3a 73 70 61 6d 6c 31 3a 61 75 31 3a 62 65 65`) corresponds to
    `{u"spam": [b"a", u"b"]}`.

    Keys must be Unicode or byte strings, and appear in the certain order:

     -  Unicode strings do not appear earlier than byte strings.

     -  Byte strings are sorted as raw strings, not alphanumerics.

     -  Unicode strings are sorted as their UTF-8 *byte* representations,
        *not* any collation order or chart order listed by Unicode.

        For example, `b` (`62`) should be followed by `á` (`C3 A1`),
        because the byte `62` is less than the byte `C3`.

    `du1:k1:v1:k1:ve` (`64 75 31 3a 6b 31 3a 76 31 3a 6b 31 3a 76 65`) is
    invalid because `u1:k` appear earlier than `1:k`.


Test suite
----------

The *testsuite/* directory contains a set of Bencodex tests.  Every test case
is a triple of *.dat* which is an arbitrary Bencodex data, a *.yaml* which
is its corresponding value in YAML, and a *.json* which is an alternative to
YAML and renders an AST of the Bencodex value.

For example, *list.dat* contains the below Bencodex data:

~~~~ bencodex
lu16:a Unicode string13:a byte stringi123ei-456etfndu1:au4:dictelu1:au4:listee
~~~~

which encodes the value corresponding to *list.yaml*, that is:

~~~~ yaml
- a Unicode string
- !!binary "YSBieXRlIHN0cmluZw=="  # b"a byte string"
- 123
- -456
- true
- false
- null
- a: dict
- [a, list]
~~~~

Or, as an alternative there's *list.json* which renders an AST of the value
structure:

~~~~ json
{
  "type": "list",
  "values": [
    {
      "type": "text",
      "value": "a Unicode string"
    },
    {
      "base64": "YSBieXRlIHN0cmluZw==",
      "type": "binary"
    },
    {
      "decimal": "123",
      "type": "integer"
    },
    {
      "decimal": "-456",
      "type": "integer"
    },
    {
      "type": "boolean",
      "value": true
    },
    {
      "type": "boolean",
      "value": false
    },
    {
      "type": "null"
    },
    {
      "pairs": [
        {
          "key": {
            "type": "text",
            "value": "a"
          },
          "value": {
            "type": "text",
            "value": "dict"
          }
        }
      ],
      "type": "dictionary"
    },
    {
      "type": "list",
      "values": [
        {
          "type": "text",
          "value": "a"
        },
        {
          "type": "text",
          "value": "list"
        }
      ]
    }
  ]
}
~~~~

Note that the schema of *.json* files is formally described in [JSON Schema].
see also [utils/testsuite-schema.json](./utils/testsuite-schema.json).

An implementation should satisfy the below rules:

 -  Bytes that an encoder builds from a YAML/JSON content should be exactly
    same to the contents of a *.dat* file that corresponds to
    the *.yaml*/*json* file.

 -  A content a decoder read from a *.dat* file should be equivalent to
    the content of a *.yaml*/*.json* file that corresponds to the *.dat* file.

[JSON Schema]: https://json-schema.org/


----

This document (*README.md*) and every content in this repository including
the test suite (*testsuite/*) are in the public domain.