# vu128: Efficient variable-length integers `vu128` is a variable-length integer encoding, with smaller values being encoded using fewer bytes. Integer sizes up to 128 bits are supported. The compression ratio of `vu128` equals or exceeds the widely used [VLQ] and [LEB128] encodings, and is faster on modern pipelined architectures. [VLQ]: https://en.wikipedia.org/wiki/Variable-length_quantity [LEB128]: https://en.wikipedia.org/wiki/LEB128 # Encoding details Values in the range `[0, 2^7)` are encoded as a single byte with the same bits as the original value. Values in the range `[2^7, 2^28)` are encoded as a unary length prefix, followed by `(length*7)` bits, in little-endian order. This is conceptually similar to LEB128, but the continuation bits are placed in upper half of the initial byte. This arrangement is also known as a "prefix varint". ```text MSB ------------------ LSB 10101011110011011110 Input value (0xABCDE) 0101010 1111001 1011110 Zero-padded to a multiple of 7 bits 01010101 11100110 ___11110 Grouped into octets, with 3 continuation bits 01010101 11100110 11011110 Continuation bits `110` added 0x55 0xE6 0xDE In hexadecimal [0xDE, 0xE6, 0x55] Encoded output (order is little-endian) ``` Values in the range `[2^28, 2^128)` are encoded as a binary length prefix, followed by payload bytes, in little-endian order. To differentiate this format from the format of smaller values, the top 4 bits of the first byte are set. The length prefix value is the number of payload bytes minus one; equivalently it is the total length of the encoded value minus two. ```text MSB ------------------------------------ LSB 10010001101000101011001111000 Input value (0x12345678) 00010010 00110100 01010110 01111000 Zero-padded to a multiple of 8 bits 00010010 00110100 01010110 01111000 11110011 Prefix byte is `0xF0 | (4 - 1)` 0x12 0x34 0x56 0x78 0xF3 In hexadecimal [0xF3, 0x78, 0x56, 0x34, 0x12] Encoded output (order is little-endian) ``` # Handling of over-long encodings The `vu128` format permits over-long encodings, which encode a value using a byte sequence that is unnecessarily long: * Zero-padding beyond that required to reach a multiple of 7 or 8 bits. * Using a length prefix byte for a value in the range `[0, 2^7)`. * Using a binary length prefix byte for a value in the range `[0, 2^28)`. The `encode_*` functions in this module will not generate such over-long encodings, but the `decode_*` functions will accept them. This is intended to allow `vu128` values to be placed in a buffer before the value to be written is known. Applications that require a single canonical encoding for any given value should perform appropriate checking in their own code. # Signed integers and floating-point values Signed integers and IEEE-754 floating-point values may be encoded with `vu128` by mapping them to unsigned integers. It is recommended that the mapping functions be chosen so as to minimize the number of zeroes in the higher-order bits, which enables better compression. This library includes helper functions that use Protocol Buffer's ["ZigZag" encoding] for signed integers and reverse-endian layout for floating-point. ["ZigZag" encoding]: https://protobuf.dev/programming-guides/encoding/#signed-ints