# Protozero Tutorial ## Getting to know Protocol Buffers Protozero is a very low level library. You really have to know some of the insides of Protocol Buffers to work with it! So before reading any further in this document, read the following from the Protocol Buffer documentation: * [Developer Guide - Overview](https://developers.google.com/protocol-buffers/docs/overview) * [Language Guide](https://developers.google.com/protocol-buffers/docs/proto) * [Encoding](https://developers.google.com/protocol-buffers/docs/encoding) Make sure you understand the basic types of values supported by Protocol Buffers. Refer to this [handy table](https://developers.google.com/protocol-buffers/docs/proto#scalar) and [the cheat sheet](cheatsheet.md) if you are getting lost. ## Prerequisites You need a C++11-capable compiler for Protozero to work. Copy the files in the `include/protozero` directory somewhere where your build system can find them. Keep the `protozero` directory and include the files in the form ```cpp #include ``` ## Parsing protobuf-encoded messages ### Using `pbf_reader` To use the `pbf_reader` class, add this include to your C++ program: ```cpp #include ``` The `pbf_reader` class contains asserts that will detect some programming errors. We encourage you to compile with asserts enabled in your debug builds. ### An introductory example Lets say you have a protocol description in a `.proto` file like this: ```cpp message Example1 { required uint32 x = 1; optional string s = 2; repeated fixed64 r = 17; } ``` To read messages created according to that description, you will have code that looks somewhat like this: ```cpp #include // get data from somewhere into the input string std::string input = get_input_data(); // initialize pbf message with this data protozero::pbf_reader message{input}; // iterate over fields in the message while (message.next()) { // switch depending on the field tag (the field name is not available) switch (message.tag()) { case 1: // get data for tag 1 (in this case an uint32) auto x = message.get_uint32(); break; case 2: // get data for tag 2 (in this case a string) std::string s = message.get_string(); break; case 17: // ignore data for tag 17 message.skip(); break; default: // ignore data for unknown tags to allow for future extensions message.skip(); } } ``` You always have to call `next()` and then either one of the accessor functions (like `get_uint32()` or `get_string()`) to get the field value or `skip()` to ignore this field. Then call `next()` again, and so forth. Never call `next()` twice in a row or any if the accessor or skip functions twice in a row. Because the `pbf_reader` class doesn't know the `.proto` file it doesn't know which field names or tags there are and it doesn't known the types of the fields. You have to make sure to call the right `get_...()` function for each tag. Some `assert()s` are done to check you are calling the right functions, but not all errors can be detected. Note that it doesn't matter whether a field is defined as `required`, `optional`, or `repeated`. You always have to be prepared to get zero, one, or more instances of a field and you always have to be prepared to get other fields, too, unless you want your program to break if somebody adds a new field. ### If you only need a single field If, out of a protocol buffer message, you only need the value of a single field, you can use the version of the `next()` function with a parameter: ```cpp // same .proto file and initialization as above // get all fields with tag 17, skip all others while (message.next(17)) { auto r = message.get_fixed64(); std::cout << r << "\n"; } ``` ### Handling scalar fields As you saw in the example, handling scalar field types is reasonably easy. You just check the `.proto` file for the type of a field and call the corresponding function called `get_` + _field type_. For `string` and `bytes` types the internal handling is exactly the same, but both `get_string()` and `get_bytes()` are provided to make the code self-documenting. Both theses calls allocate and return a `std::string` which can add some overhead. You can call the `get_view()` function instead which returns a `data_view` containing a pointer into the data (access with `data()`) and the length of the data (access with `size()`). ### Handling repeated packed fields Fields that are marked as `[packed=true]` in the `.proto` file are handled somewhat differently. `get_packed_...()` functions returning an iterator range are used to access the data. So, for example, if you have a protocol description in a `.proto` file like this: ```cpp message Example2 { repeated sint32 i = 1 [packed=true]; } ``` You can get to the data like this: ```cpp protozero::pbf_reader message{input.data(), input.size()}; // set current field message.next(1); // get an iterator range auto pi = message.get_packed_sint32(); // iterate to get to all values for (auto it = pi.begin(); it != pi.end(); ++it) { std::cout << *it << '\n'; } ``` Or, with a range-based for-loop: ```cpp for (auto value : pi) { std::cout << v << '\n'; } ``` So you are getting a pair of normal forward iterators wrapped in an iterator range object. The iterators can be used with any STL algorithms etc. Note that the previous only applies to repeated **packed** fields, normal repeated fields are handled in the usual way for scalar fields. ### Handling embedded messages Protocol Buffers can embed any message inside another message. To access an embedded message use the `get_message()` function. So for this description: ```cpp message Point { required double x = 1; required double y = 2; } message Example3 { repeated Point point = 10; } ``` you can parse with this code: ```cpp protozero::pbf_reader message{input}; while (message.next(10)) { protozero::pbf_reader point = message.get_message(); double x, y; while (point.next()) { switch (point.tag()) { case 1: x = point.get_double(); break; case 2: y = point.get_double(); break; default: point.skip(); } } std::cout << "x=" << x << " y=" << y << "\n"; } ``` ### Handling enums Enums are stored as varints and they can't be differentiated from them. Use the `get_enum()` function to get the value of the enum, you have to translate this into the symbolic name yourself. See the `enum` test case for an example. ### Asserts and exceptions in the Protozero library Protozero uses `assert()` liberally to help you find bugs in your own code when compiled in debug mode (ie with `NDEBUG` not set). If such an assert "fires", this is a very strong indication that there is a bug in your code somewhere. (Protozero will disable those asserts and "convert" them into exception in its own test code. This is done to make sure the asserts actually work as intended. Your test code will not need this!) Exceptions, on the other hand, are thrown by Protozero if some kind of data corruption was detected while it is trying to parse the data. This could also be an indicator for a bug in the user code, but because it can happen if the data was (intentionally or not intentionally) been messed with, it is reported to the user code using exceptions. Most of the functions on the writer side can throw a `std::bad_alloc` exception if there is no space to grow a buffer. Other than that no exceptions can occur on the writer side. All exceptions thrown by the reader side derive from `protozero::exception`. Note that all exceptions can also happen if you are expecting a data field of a certain type in your code but the field actually has a different type. In that case the `pbf_reader` class might interpret the bytes in the buffer in the wrong way and anything can happen. #### `end_of_buffer_exception` This will be thrown whenever any of the functions "runs out of input data". It means you either have an incomplete message in your input or some other data corruption has taken place. #### `unknown_pbf_wire_type_exception` This will be thrown if an unsupported wire type is encountered. Either your input data is corrupted or it was written with an unsupported version of a Protocol Buffers implementation. #### `varint_too_long_exception` This exception indicates an illegal encoding of a varint. It means your input data is corrupted in some way. #### `invalid_tag_exception` This exception is thrown when a tag has an invalid value. Tags must be unsigned integers between 1 and 2^29-1. Tags between 19000 and 19999 are not allowed. See https://developers.google.com/protocol-buffers/docs/proto#assigning-tags #### `invalid_length_exception` This exception is thrown when a length field of a packed repeated field is invalid. For fixed size types the length must be a multiple of the size of the type. ### The `pbf_reader` class The `pbf_reader` class behaves like a value type. Objects are reasonably small (two pointers and two `uint32_t`, so 24 bytes on a 64bit system) and they can be copied and moved around trivially. `pbf_reader` objects can be constructed from a `std::string` or a `const char*` and a length field (either supplied as separate arguments or as a `std::pair`). In all cases objects of the `pbf_reader` class store a pointer into the input data that was given to the constructor. You have to make sure this pointer stays valid for the duration of the objects lifetime. ## Parsing protobuf-encoded messages using `pbf_message` One problem in the code above are the "magic numbers" used as tags for the different fields that you got from the `.proto` file. Instead of spreading these magic numbers around your code you can define them once in an `enum class` and then use the `pbf_message` template class instead of the `pbf_reader` class. Here is the first example again, this time using this new technique. So you have the following in a `.proto` file: ```cpp message Example1 { required uint32 x = 1; optional string s = 2; repeated fixed64 r = 17; } ``` Add the following declaration in one of your header files: ```cpp enum class Example1 : protozero::pbf_tag_type { required_uint32_x = 1, optional_string_s = 2, repeated_fixed64_r = 17 }; ``` The message name becomes the name of the `enum class` which is always built on top of the `protozero::pbf_tag_type` type. Each field in the message becomes one value of the enum. In this case the name is created from the type (including the modifiers like `required` or `optional`) and the name of the field. You can use any name you want, but this convention makes it easier later, to get everything right. To read messages created according to that description, you will have code that looks somewhat like this, this time using `pbf_message` instead of `pbf_reader`: ```cpp #include // get data from somewhere into the input string std::string input = get_input_data(); // initialize pbf message with this data protozero::pbf_message message{input}; // iterate over fields in the message while (message.next()) { // switch depending on the field tag (the field name is not available) switch (message.tag()) { case Example1::required_uint32_x: auto x = message.get_uint32(); break; case Example1::optional_string_s: std::string s = message.get_string(); break; case Example1::repeated_fixed64_r: message.skip(); break; default: // ignore data for unknown tags to allow for future extensions message.skip(); } } ``` Note the correspondance between the enum value (for instance `required_uint32_x`) and the name of the getter function (for instance `get_uint32()`). This makes it easier to get the correct types. Also the naming makes it easier to keep different message types apart if you have multiple (or embedded) messages. See the `test/t/complex` test case for a complete example using this interface. Using `pbf_message` in favour of `pbf_reader` is recommended for all code. Note that `pbf_message` derives from `pbf_reader`, so you can always fall back to the more generic interface if necessary. One problem you might run into is the following: The enum class lists all possible values you know about and you'll have lots of `switch` statements checking those values. Some compilers will know that your `switch` covers all possible cases and warn you if you have a `default` case that looks unneccessary to the compiler. But you still want that `default` case to allow for future extension of those messages (and maybe also to detect corrupted data). You can switch of this warning with `-Wno-covered-switch-default`). ## Writing protobuf-encoded messages ### Using `pbf_writer` To use the `pbf_writer` class, add this include to your C++ program: ```cpp #include ``` The `pbf_writer` class contains asserts that will detect some programming errors. We encourage you to compile with asserts enabled in your debug builds. ### An introductory example Lets say you have a protocol description in a `.proto` file like this: ```cpp message Example { required uint32 x = 1; optional string s = 2; repeated fixed64 r = 17; } ``` To write messages created according to that description, you will have code that looks somewhat like this: ```cpp #include std::string data; protozero::pbf_writer pbf_example{data}; pbf_example.add_uint32(1, 27); // uint32_t x pbf_example.add_fixed64(17, 1); // fixed64 r pbf_example.add_fixed64(17, 2); pbf_example.add_fixed64(17, 3); pbf_example.add_string(2, "foobar"); // string s ``` First you need a string which will be used as buffer to assemble the protobuf-formatted message. The `pbf_writer` object contains a reference to this string buffer and through it you add data to that buffer piece by piece. The buffer doesn't have to be empty, the `pbf_writer` will simply append its data to whatever is there already. ### Handling scalar fields As you could see in the introductory example handling any kind of scalar field is easy. The type of field doesn't matter and it doesn't matter whether it is optional, required or repeated. You always call one of the `add_TYPE()` method on the pbf writer object. The first parameter of these methods is always the *tag* of the field (the field number) from the `.proto` file. The second parameter is the value you want to set. For the `bytes` and `string` types several versions of the add method are available taking a `const std::string&` or a `const char*` and a length. For `enum` types you have to use the numeric value as the symbolic names from the `.proto` file are not available. ### Handling repeated packed fields Repeated packed fields can easily be set from a pair of iterators: ```cpp std::string data; protozero::pbf_writer pw{data}; std::vector v = { 1, 4, 9, 16, 25, 36 }; pw.add_packed_int32(1, std::begin(v), std::end(v)); ``` If you don't have an iterator you can use the alternative form: ```cpp std::string data; protozero::pbf_writer pw{data}; { protozero::packed_field_int32 field{pw, 1}; field.add_element(1); field.add_element(10); field.add_element(100); } ``` Of course you can add as many elements as you want. If you add no elements at all, this code will still work, Protozero detects this special case and pretends you never even initialized this field. The nested scope is important in this case, because the destructor of the `field` object will make sure the length stored inside the field is set to the right value. You must close that scope before adding other fields to the `pw` pbf writer. If you know how many elements you will add to the field and your field contains fixed length elements, you can tell Protozero and it can optimize this case: ```cpp std::string data; protozero::pbf_writer pw{data}; { protozero::packed_field_fixed32 field{pw, 1, 2}; // exactly two elements field.add_element(42); field.add_element(13); } ``` In this case you have to supply exactly as many elements as you promised, otherwise you will get a broken protobuf message. This works for `packed_field_fixed32`, `packed_field_sfixed32`, `packed_field_fixed64`, `packed_field_sfixed64`, `packed_field_float`, and `packed_field_double`. You can abandon writing of the packed field if this becomes necessary by calling `rollback()`: ```cpp std::string data; protozero::pbf_writer pw{data}; { protozero::packed_field_int32 field{pw, 1}; field.add_element(42); // some error occurs, you don't want to have this field at all field.rollback(); } ``` The result is the same as if the lines inside the nested brackets had never been called. Do not try to call `add_element()` after a rollback. ### Handling sub-messages Nested sub-messages can be handled by first creating the submessage and then adding to the parent message: ```cpp std::string buffer_sub; protozero::pbf_writer pbf_sub{buffer_sub}; // add fields to sub-message pbf_sub.add_...(...); // ... // sub-message is finished here std::string buffer_parent; protozero::pbf_writer pbf_parent{buffer_parent}; pbf_parent.add_message(1, buffer_sub); ``` This is easy to do but it has the drawback of needing a separate `std::string` buffer. If this concerns you (and why would you use protozero and not the Google protobuf library if it doesn't?) there is another way: ```cpp std::string data; protozero::pbf_writer pbf_parent{data}; // optionally add fields to parent here pbf_parent.add_...(...); // open a new scope { // create new pbf_writer with parent and the tag (field number) // as parameters protozero::pbf_writer pbf_sub{pbf_parent, 1}; // add fields to sub here... pbf_sub.add_...(...); } // closing the scope will close the sub-message // optionally add more fields to parent here pbf_parent.add_...(...); ``` This can be nested arbitrarily deep. Internally the sub-message writer re-uses the buffer from the parent. It reserves enough space in the buffer to later write the length of the submessage into it. It then adds the contents of the submessage to the buffer. When the `pbf_sub` writer is destructed the length of the submessage is calculated and written in the reserved space. If less space was needed for the length field than was available, the rest of the buffer is moved over a few bytes. You can abandon writing of submessage if this becomes necessary by calling `rollback()`: ```cpp std::string data; protozero::pbf_writer pbf_parent{data}; // open a new scope { // create new pbf_writer with parent and the tag (field number) // as parameters protozero::pbf_writer pbf_sub{pbf_parent, 1}; // add fields to sub here... pbf_sub.add_...(...); // some problem occurs and you want to abandon the submessage: pbf_sub.rollback(); } // optionally add more fields to parent here pbf_parent.add_...(...); ``` The result is the same as if the lines inside the nested brackets had never been called. Do not try to call any of the `add_*` functions on the submessage after a rollback. ## Writing protobuf-encoded messages using `pbf_builder` Just like the `pbf_message` template class wraps the `pbf_reader` class, there is a `pbf_builder` template class wrapping the `pbf_writer` class. It is instantiated using the same `enum class` described above and used exactly like the `pbf_writer` class but using the values of the enum instead of bare integers. See the `test/t/complex` test case for a complete example using this interface.