===========
BSON Corpus
===========

:Title: BSON Corpus
:Author: David Golden
:Lead: Jeff Yemin
:Advisors: Craig Wilson
:Status: Approved
:Type: Standards
:Minimum Server Version: N/A
:Last Modified: January 23, 2017
:Version: 1.3

.. contents::

Abstract
========

The official BSON specification does not include test data, so this
pseudo-specification describes tests for BSON encoding and decoding.  Since
MongoDB's "Extended JSON" (hereafter ``extjson``) format is used for
human-readable interchange of BSON documents, we include tests for encoding
and decoding it as well, subject to certain limitations.

Meta
====

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this document are to be
interpreted as described in `RFC 2119`_.

.. _RFC 2119: https://www.ietf.org/rfc/rfc2119.txt

Motivation for Change
=====================

To ensure correct operation, we want drivers to implement identical tests
for important features.  BSON (and ``extjson``) are critical for correct
operation and data exchange, but historically had no common test corpus.
This pseudo-specification provides such tests.

Goals
-----

* Provide machine-readable test data files for BSON and ``extjson`` encoding
  and decoding.

* Cover all current and historical BSON types.

* Define test data patterns for three cases: (a) roundtrip, (b) decode
  errors, and (c) parse errors.

Non-Goals
---------

* Replace or extend the offical BSON spec at http://bsonspec.org.

* Provide a formal specification for ``extjson``.

Specification
=============

The specification for BSON lives at http://bsonspec.org.  The ``extjson``
format has no specification, but is (partially) documented at
https://docs.mongodb.com/manual/reference/mongodb-extended-json/.

(Note that ``extjson`` formats generated and documented have changed slightly
from server-version to server-version.)

Test Plan
=========

This test plan describes a general approach for BSON testing.  Future BSON
specifications (such as for new types like Decimal128) may specialize or
alter the approach described below.

Description of the BSON Corpus
------------------------------

This BSON test data corpus consists of a JSON file for each BSON type, plus
a ``top.json`` file for testing the overall, enclosing document.

Top level keys include:

* ``description``: human-readable description of what is in the file

* ``bson_type``: hex string of the first byte of a BSON element (e.g. "0x01"
  for type "double"); this will be the synthetic value "0x00" for ``top.json``.

* ``test_key``: name of a field in a ``valid`` test case ``extjson`` document
  should be checked against the case's ``string`` field.

* ``valid`` (optional): an array of validity test cases (see below).

* ``decodeErrors`` (optional): an array of decode error cases (see below).

* ``parseErrors`` (optional): an array of type-specific parse error case (see
  below).

* ``deprecated`` (optional): this field will be present (and true) if the
  BSON type has been deprecated (i.e. Symbol, Undefined and DBPointer)

Validity test case keys include:

* ``description``: human-readable test case label.

* ``bson``: an (uppercase) big-endian hex representation of a BSON byte
  string.  Be sure to mangle the case as appropriate in any roundtrip
  tests.

* ``extjson``: a document representing the decoded extended JSON document
  equivalent to the subject.

* ``canonical_bson`` (optional): like ``bson``, but is the hex string
  representation of expected BSON encoder output iff ``bson`` would not be
  generated by a correct encoder (e.g. bad array keys).

* ``canonical_extjson`` (optional): like ``extjson`` but is the extended JSON
  encoder output iff ``extjson`` would not be generated by a correct encoder
  (e.g. the "datetime" type that has more than one extended JSON
  representation).

* ``lossy`` (optional) -- boolean; present (and true) iff ``canonical_bson``
  (or ``bson`` if there is no ``canonical_bson``) can't be represented exactly
  with extended JSON (e.g. NaN with a payload).

Decode error cases provide an invalid BSON document or field that
should result in an error. For each case, keys include:

* ``description``: human-readable test case label.

* ``bson``: an (uppercase) big-endian hex representation of an invalid
  BSON string that should fail to decode correctly.

Parse error cases are type-specific and represent some input that can not
be encoded to the ``bson_type`` under test.  For each case, keys include:

* ``description``: human-readable test case label.

* ``string``: a text or numeric representation of an input that can't be
  parsed to a valid value of the given type.

Extended JSON encoding, escaping and ordering
---------------------------------------------

Because the ``extjson`` and ``canonical_extjson`` fields are embedded in a
JSON document, all their JSON metacharacters are escaped.  Control
characters and non-ASCII codepoints are represented with ``\uXXXX``.  Note
that this means that the corpus JSON will appear to have double-escaped
characters ``\\uXXXX``.  This is by design to ensure that the ``extjson``
field remains printable ASCII without embedded null characters to ensure
maximum portability to different language JSON or extended JSON decoders.

The JSON format is *unordered* and whitespace (outside of strings) is not
significant.  Implementations using these tests are responsible for
normalizing JSON however necessary for effective comparison.

Language-specific differences
-----------------------------

Some programming languages may not be able to represent or transmit all
types accurately.  In such cases, implementations SHOULD ignore (or modify)
any tests which are not supported on that platform.

Testing validity
----------------

To test validity of a case in the ``valid`` array, we consider up to four
possible "input" representations: BSON, "canonical" BSON,
extended JSON, and "canonical" extended JSON.  (Not all will exist for a
given case).

For any input, we wish to see if it can be correctly decoded, then re-encoded
to "canonical" BSON and extended JSON representations.  This means there
are up to eight assertions (four input types; two output types).

In some cases, there may be less than four inputs or some conversions may
not be valid resulting in fewer than eight assertions.

The following pseudo-code describes which assertions drivers SHOULD
test for a given case::

    B  = decode_hex( case["bson"] )
    E  = case["extjson"]

    if "canonical_bson" in case:
        cB = decode_hex( case["canonical_bson"] )
    else:
        cB = B

    if "canonical_extjson" in case:
        cE = case["canonical_extjson"]
    else:
        cE = E

    assert encode_bson(decode_bson(B)) == cB                    # B->cB

    if B != cB:
        assert encode_bson(decode_bson(cB)) == cB               # cB->cB

    if "extjson" in case:
        assert encode_extjson(decode_bson(B)) == cE             # B->cE
        assert encode_extjson(decode_extjson(E)) == cE          # E->cE

        if B != cB:
            assert encode_extjson(decode_bson(cB)) == cE        # cB->cE

        if  E != cE:
            assert encode_extjson(decode_extjson(cE)) == cE     # cE->cE

        if "lossy" not in case:
            assert encode_bson(decode_extjson(E)) == cB         # E->cB

            if E != cE:
                assert encode_bson(decode_extjson(cE)) == cB    # cE->cB

Implementations MAY test assertions in an implementation-specific
manner.

Testing decode errors
---------------------

The ``decodeErrors`` cases represent BSON documents that are sufficiently
incorrect that they can't be parsed even with liberal interpretation of
the BSON schema (e.g. reading arrays with invalid keys is possible, even
though technically invalid, so they are *not* ``decodeErrors``).

Drivers SHOULD test that each case results in a decoding error.
Implementations MAY to test assertion in an implementation-specific
manner.

Testing parsing errors
----------------------

The interpretation of ``parseErrors`` is type-specific.  For example,
helpers for creating Decimal128 values may parse strings to convert them
to binary Decimal128 values.  The ``parseErrors`` cases are strings that
will *not* convert correctly.

The documentation for a type (if any) will specify how to use these
cases for testing.

Drivers SHOULD test that each case results in a parse error.
Implementations MAY to test assertion in an implementation-specific
manner.

Deprecated types
----------------

The corpus files for deprecated types are provided for informational
purposes.  Implementations MAY ignore or modify them to match legacy
treatment of deprecated types.

Implementation Notes
====================

A tool for visualizing BSON
---------------------------

The test directory includes a Perl script ``bsonview``, which will
decompose and highlight elements of a BSON document.  It may be used like
this::

    echo "0900000010610005000000" | perl bsonview -x

Notes for certain types
-----------------------

Array
~~~~~

Arrays can have non-canonical BSON if the array indexes are not set as
"0", "1", etc.

Boolean
~~~~~~~

The only valid values are 0 and 1.  Other non-zero numbers MUST be
interpreted as errors rather than "true" values.

Binary
~~~~~~

The Base64 encoded text in the extended JSON representation MUST be padded.

Code
~~~~

There are multiple ways to encode Unicode characters as a JSON document.
Individual implementers may need to normalize provided and generated
extended JSON before comparison.

DateTime
~~~~~~~~

The "canonical" extended JSON format is $numberLong as this allow *exact*
representation of the underlying BSON binary data without requiring
parsing or rendering and without using system libraries for conversion.
The ISO-8601 UTC ("Zulu") with millisecond is an allowed alternate input.

This differs from mongoexport behavior (which itself has changed over the
years) by design to ensure the more robust representation.  Implementations
MAY output ISO-8601 by default if necessary for legacy compatibility
reasons and should swap "extjson" and "canonical_extjson" for validity
cases.

Decimal
~~~~~~~

NaN with payload can't be represented in extended JSON, so such conversions are
lossy.

Double
~~~~~~

There is not yet a way to represent Inf, -Inf or NaN in extended JSON.  Even if
a $numberDouble is added, it is unlikely to support special values with
payloads, so such doubles would be lossy when converted to extended JSON.

String
~~~~~~

There are multiple ways to encode Unicode characters as a JSON document.
Individual implementers may need to normalize provided and generated
extended JSON before comparison.

DBPointer
~~~~~~~~~

This type is deprecated and there is no DBPointer representation in
extended JSON.  mongoexport converts these to DBRef documents, but such
conversion is outside the scope of this spec.

Symbol
~~~~~~

This type is deprecated and there is no Symbol representation in extended JSON.
mongoexport converts these to strings, but such conversion is outside the
scope of this spec.

Undefined
~~~~~~~~~

This type is deprecated, but there is a "$undefined" representation in
extended JSON.


Reference Implementation
========================

The Java, C# and Perl drivers.

Design Rationale
================

Use of extjson
--------------

Testing conversion requires an "input" and an "output".  With a BSON string
as both input and output, we can only test that it roundtrips correctly --
we can't test that the decoded value visible to the language is correct.

For example, a pathological encoder/decoder could invert Boolean true and
false during decoding and encoding.  The BSON would roundtrip but the
program would see the wrong values.

Therefore, we need a separate, semantic description of the contents of a BSON
string in a machine readable format.  Fortunately, we already have extjson as a
means of doing so.  The extended JSON strings contained within the tests adhere
to the Extended JSON Specification.

Repetition across cases
-----------------------

Some validity cases may result in duplicate assertions across cases,
particularly if the ``bson`` field is different in different cases, but the
``canonical_bson`` field is the same.  This is by design so that each case
stands alone and can be confirmed to be internally consistent via the
assertions.  This makes for easier and safer test case development.

Changes
=======

Version 1.3 - January 23, 2017

* Added ``multi-type.json`` to test encoding and decoding all BSON types within
  the same document.

* Amended all extended JSON strings to adhere to the Extended JSON
  Specification.

* Modified the "Use of extjson" section of this specification to note that
  canonical extended JSON is now used.

Version 1.2 - November 14, 2016

* Removed "invalid flags" BSON Regexp case.

Version 1.1 – October 25, 2016

* Added a "non-alphabetized flags" case to the BSON Regexp corpus file;
  decoders must be able to read non-alphabetized flags, but encoders must
  emit alphabetized flags.

* Added an "invalid flags" case to the BSON Regexp corpus file.