=========== BSON Corpus =========== :Title: BSON Corpus :Author: David Golden :Lead: Jeff Yemin :Advisors: Craig Wilson :Status: Approved :Type: Standards :Minimum Server Version: N/A :Last Modified: January 23, 2017 :Version: 1.3 .. contents:: Abstract ======== The official BSON specification does not include test data, so this pseudo-specification describes tests for BSON encoding and decoding. Since MongoDB's "Extended JSON" (hereafter ``extjson``) format is used for human-readable interchange of BSON documents, we include tests for encoding and decoding it as well, subject to certain limitations. Meta ==== The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in `RFC 2119`_. .. _RFC 2119: https://www.ietf.org/rfc/rfc2119.txt Motivation for Change ===================== To ensure correct operation, we want drivers to implement identical tests for important features. BSON (and ``extjson``) are critical for correct operation and data exchange, but historically had no common test corpus. This pseudo-specification provides such tests. Goals ----- * Provide machine-readable test data files for BSON and ``extjson`` encoding and decoding. * Cover all current and historical BSON types. * Define test data patterns for three cases: (a) roundtrip, (b) decode errors, and (c) parse errors. Non-Goals --------- * Replace or extend the offical BSON spec at http://bsonspec.org. * Provide a formal specification for ``extjson``. Specification ============= The specification for BSON lives at http://bsonspec.org. The ``extjson`` format has no specification, but is (partially) documented at https://docs.mongodb.com/manual/reference/mongodb-extended-json/. (Note that ``extjson`` formats generated and documented have changed slightly from server-version to server-version.) Test Plan ========= This test plan describes a general approach for BSON testing. Future BSON specifications (such as for new types like Decimal128) may specialize or alter the approach described below. Description of the BSON Corpus ------------------------------ This BSON test data corpus consists of a JSON file for each BSON type, plus a ``top.json`` file for testing the overall, enclosing document. Top level keys include: * ``description``: human-readable description of what is in the file * ``bson_type``: hex string of the first byte of a BSON element (e.g. "0x01" for type "double"); this will be the synthetic value "0x00" for ``top.json``. * ``test_key``: name of a field in a ``valid`` test case ``extjson`` document should be checked against the case's ``string`` field. * ``valid`` (optional): an array of validity test cases (see below). * ``decodeErrors`` (optional): an array of decode error cases (see below). * ``parseErrors`` (optional): an array of type-specific parse error case (see below). * ``deprecated`` (optional): this field will be present (and true) if the BSON type has been deprecated (i.e. Symbol, Undefined and DBPointer) Validity test case keys include: * ``description``: human-readable test case label. * ``bson``: an (uppercase) big-endian hex representation of a BSON byte string. Be sure to mangle the case as appropriate in any roundtrip tests. * ``extjson``: a document representing the decoded extended JSON document equivalent to the subject. * ``canonical_bson`` (optional): like ``bson``, but is the hex string representation of expected BSON encoder output iff ``bson`` would not be generated by a correct encoder (e.g. bad array keys). * ``canonical_extjson`` (optional): like ``extjson`` but is the extended JSON encoder output iff ``extjson`` would not be generated by a correct encoder (e.g. the "datetime" type that has more than one extended JSON representation). * ``lossy`` (optional) -- boolean; present (and true) iff ``canonical_bson`` (or ``bson`` if there is no ``canonical_bson``) can't be represented exactly with extended JSON (e.g. NaN with a payload). Decode error cases provide an invalid BSON document or field that should result in an error. For each case, keys include: * ``description``: human-readable test case label. * ``bson``: an (uppercase) big-endian hex representation of an invalid BSON string that should fail to decode correctly. Parse error cases are type-specific and represent some input that can not be encoded to the ``bson_type`` under test. For each case, keys include: * ``description``: human-readable test case label. * ``string``: a text or numeric representation of an input that can't be parsed to a valid value of the given type. Extended JSON encoding, escaping and ordering --------------------------------------------- Because the ``extjson`` and ``canonical_extjson`` fields are embedded in a JSON document, all their JSON metacharacters are escaped. Control characters and non-ASCII codepoints are represented with ``\uXXXX``. Note that this means that the corpus JSON will appear to have double-escaped characters ``\\uXXXX``. This is by design to ensure that the ``extjson`` field remains printable ASCII without embedded null characters to ensure maximum portability to different language JSON or extended JSON decoders. The JSON format is *unordered* and whitespace (outside of strings) is not significant. Implementations using these tests are responsible for normalizing JSON however necessary for effective comparison. Language-specific differences ----------------------------- Some programming languages may not be able to represent or transmit all types accurately. In such cases, implementations SHOULD ignore (or modify) any tests which are not supported on that platform. Testing validity ---------------- To test validity of a case in the ``valid`` array, we consider up to four possible "input" representations: BSON, "canonical" BSON, extended JSON, and "canonical" extended JSON. (Not all will exist for a given case). For any input, we wish to see if it can be correctly decoded, then re-encoded to "canonical" BSON and extended JSON representations. This means there are up to eight assertions (four input types; two output types). In some cases, there may be less than four inputs or some conversions may not be valid resulting in fewer than eight assertions. The following pseudo-code describes which assertions drivers SHOULD test for a given case:: B = decode_hex( case["bson"] ) E = case["extjson"] if "canonical_bson" in case: cB = decode_hex( case["canonical_bson"] ) else: cB = B if "canonical_extjson" in case: cE = case["canonical_extjson"] else: cE = E assert encode_bson(decode_bson(B)) == cB # B->cB if B != cB: assert encode_bson(decode_bson(cB)) == cB # cB->cB if "extjson" in case: assert encode_extjson(decode_bson(B)) == cE # B->cE assert encode_extjson(decode_extjson(E)) == cE # E->cE if B != cB: assert encode_extjson(decode_bson(cB)) == cE # cB->cE if E != cE: assert encode_extjson(decode_extjson(cE)) == cE # cE->cE if "lossy" not in case: assert encode_bson(decode_extjson(E)) == cB # E->cB if E != cE: assert encode_bson(decode_extjson(cE)) == cB # cE->cB Implementations MAY test assertions in an implementation-specific manner. Testing decode errors --------------------- The ``decodeErrors`` cases represent BSON documents that are sufficiently incorrect that they can't be parsed even with liberal interpretation of the BSON schema (e.g. reading arrays with invalid keys is possible, even though technically invalid, so they are *not* ``decodeErrors``). Drivers SHOULD test that each case results in a decoding error. Implementations MAY to test assertion in an implementation-specific manner. Testing parsing errors ---------------------- The interpretation of ``parseErrors`` is type-specific. For example, helpers for creating Decimal128 values may parse strings to convert them to binary Decimal128 values. The ``parseErrors`` cases are strings that will *not* convert correctly. The documentation for a type (if any) will specify how to use these cases for testing. Drivers SHOULD test that each case results in a parse error. Implementations MAY to test assertion in an implementation-specific manner. Deprecated types ---------------- The corpus files for deprecated types are provided for informational purposes. Implementations MAY ignore or modify them to match legacy treatment of deprecated types. Implementation Notes ==================== A tool for visualizing BSON --------------------------- The test directory includes a Perl script ``bsonview``, which will decompose and highlight elements of a BSON document. It may be used like this:: echo "0900000010610005000000" | perl bsonview -x Notes for certain types ----------------------- Array ~~~~~ Arrays can have non-canonical BSON if the array indexes are not set as "0", "1", etc. Boolean ~~~~~~~ The only valid values are 0 and 1. Other non-zero numbers MUST be interpreted as errors rather than "true" values. Binary ~~~~~~ The Base64 encoded text in the extended JSON representation MUST be padded. Code ~~~~ There are multiple ways to encode Unicode characters as a JSON document. Individual implementers may need to normalize provided and generated extended JSON before comparison. DateTime ~~~~~~~~ The "canonical" extended JSON format is $numberLong as this allow *exact* representation of the underlying BSON binary data without requiring parsing or rendering and without using system libraries for conversion. The ISO-8601 UTC ("Zulu") with millisecond is an allowed alternate input. This differs from mongoexport behavior (which itself has changed over the years) by design to ensure the more robust representation. Implementations MAY output ISO-8601 by default if necessary for legacy compatibility reasons and should swap "extjson" and "canonical_extjson" for validity cases. Decimal ~~~~~~~ NaN with payload can't be represented in extended JSON, so such conversions are lossy. Double ~~~~~~ There is not yet a way to represent Inf, -Inf or NaN in extended JSON. Even if a $numberDouble is added, it is unlikely to support special values with payloads, so such doubles would be lossy when converted to extended JSON. String ~~~~~~ There are multiple ways to encode Unicode characters as a JSON document. Individual implementers may need to normalize provided and generated extended JSON before comparison. DBPointer ~~~~~~~~~ This type is deprecated and there is no DBPointer representation in extended JSON. mongoexport converts these to DBRef documents, but such conversion is outside the scope of this spec. Symbol ~~~~~~ This type is deprecated and there is no Symbol representation in extended JSON. mongoexport converts these to strings, but such conversion is outside the scope of this spec. Undefined ~~~~~~~~~ This type is deprecated, but there is a "$undefined" representation in extended JSON. Reference Implementation ======================== The Java, C# and Perl drivers. Design Rationale ================ Use of extjson -------------- Testing conversion requires an "input" and an "output". With a BSON string as both input and output, we can only test that it roundtrips correctly -- we can't test that the decoded value visible to the language is correct. For example, a pathological encoder/decoder could invert Boolean true and false during decoding and encoding. The BSON would roundtrip but the program would see the wrong values. Therefore, we need a separate, semantic description of the contents of a BSON string in a machine readable format. Fortunately, we already have extjson as a means of doing so. The extended JSON strings contained within the tests adhere to the Extended JSON Specification. Repetition across cases ----------------------- Some validity cases may result in duplicate assertions across cases, particularly if the ``bson`` field is different in different cases, but the ``canonical_bson`` field is the same. This is by design so that each case stands alone and can be confirmed to be internally consistent via the assertions. This makes for easier and safer test case development. Changes ======= Version 1.3 - January 23, 2017 * Added ``multi-type.json`` to test encoding and decoding all BSON types within the same document. * Amended all extended JSON strings to adhere to the Extended JSON Specification. * Modified the "Use of extjson" section of this specification to note that canonical extended JSON is now used. Version 1.2 - November 14, 2016 * Removed "invalid flags" BSON Regexp case. Version 1.1 – October 25, 2016 * Added a "non-alphabetized flags" case to the BSON Regexp corpus file; decoders must be able to read non-alphabetized flags, but encoders must emit alphabetized flags. * Added an "invalid flags" case to the BSON Regexp corpus file.