= Rationale: Or why am I bothering to rewrite nanomsg? Garrett D'Amore v0.3, April 10, 2018 NOTE: You might want to review http://nanomsg.org/documentation-zeromq.html[Martin Sustrik's rationale] for nanomsg vs. ZeroMQ. == Background I became involved in the http://www.nanomsg.org[nanomsg] community back in 2014, when I wrote https://github.com/go-mangos/mangos[mangos] as a pure http://www.golang.org[Go] implementation of the wire protocols behind _nanomsg_. I did that work because I was dissatisfied with the http://zeromq.org[_ZeroMQ_] licensing model and the {cpp} baggage that came with it. I also needed something that would work with _Go_ on http://www.illumos.org[illumos], which at the time lacked support for `cgo` (so I could not just use an FFI binding.) At the time, it was the only alternate implementation those protocols. Writing _mangos_ gave me a lot of detail about the internals of _nanomsg_ and the SP protocols. It would not be wrong to say that one of the goals of _mangos_ was to teach me about _Go_. It was my first non-trivial _Go_ project. While working with _mangos_, I wound up implementing a number of additional features, such as a TLS transport, the ability to bind to wild card ports, and the ability to determine more information about the sender of a message. This was incredibly useful in a number of projects. I initially looked at _nanomsg_ itself, as I wanted to add a TLS transport to it, and I needed to make some bug fixes (for protocol bugs for example), and so forth. == Lessons Learned Perhaps it might be better to state that there were a number of opportunities to learn from the lessons of _nanomsg_, as well as lessons we learned while building _nng_ itself. === State Machine Madness What I ran into in _nanomsg_, when attempting to improve it, was a challenging mess of state machines. _nanomsg_ has dozens of state machines, many of which feed into others, such that tracking flow through the state machines is incredibly painful. Worse, these state machines are designed to be run from a single worker thread. This means that a given socket is entirely single theaded; you could in theory have dozens, hundreds, or even thousands of connections open, but they would be serviced only by a single thread. (Admittedly non-blocking I/O is used to let the OS kernel calls run asynchronously perhaps on multiple cores, but nanomsg itself runs all socket code on a single worker thread.) There is another problem too -- the `inproc` code that moves messages between one socket and another was incredibly racy. This is because the two sockets have different locks, and so dealing with the different contexts was tricky (and consequently buggy). (I've since, I think, fixed the worst of the bugs here, but only after many hours of pulling out hair.) The state machines also make fairly linear flow really difficult to follow. For example, there is a state machine to read the header information. This may come a byte a time, and the state machine has to add the bytes, check for completion, and possibly change state, even if it is just reading a single 32-bit word. This is a lot more complex than most programmers are used to, such as `read(fd, &val, 4)`. Now to be fair, Martin Sustrik had the best intentions when he created the state machine model around which _nanomsg_ is built. I do think that from experience this is one of the most dense and unapproachable parts of _nanomsg_, in spite of the fact that Martin's goal was precisely the opposite. I consider this a "failed experiment" -- but hey failed experiments are the basis of all great science. === Thread Challenges While _nanomsg_ is mostly internally single threaded, I decided to try to emulate the simple architecture of _mangos_ using system threads. (_mangos_ benefits greatly from _Go_'s excellent coroutine facility.) Having been well and truly spoiled by _illumos_ threading (and especially _illumos_ kernel threads), I thought this would be a reasonable architecture. Sadly, this initial effort, while it worked, scaled incredibly poorly -- even so-called "modern" operating systems like _macOS_ 10.12 and _Windows_ 8.1 simply melted or failed entirely when creating any non-trivial number of threads. (To me, creating 100 threads should be a no-brainer, especially if one limits the stack size appropriately. I'm used to be able to create thousands of threads without concern. As I said, I've been spoiled. If your system falls over at a mere 200 threads I consider it a toy implementation of threading. Unfortunately most of the mainstream operating systems are therefore toy implementations.) Chalk up another failed experiment. I did find another approach which is discussed further. === File Descriptor Driven Most of the underlying I/O in _nanomsg_ is built around file descriptors, and it's internal usock structure, which is also state machine driven. This means that implementing new transports which might need something other than a file descriptor, is really non-trivial. This stymied my first attempt to add http://www.openssl.org[OpenSSL] support to get TLS added -- _OpenSSL_ has it's own `struct BIO` for this stuff, and I could not see an easy way to convert _nanomsg_'s `usock` stuff to accomodate the `struct BIO`. In retrospect, _OpenSSL_ wasn't the ideal choice for an SSL/TLS library, and we have since chosen another (https://tls.mbed.org[mbed TLS]). Still, we needed an abstraction model that was better than just file descriptors for I/O. === Poll In order to support use in event driven programming, asynchronous situations, etc. _nanomsg_ offers non-blocking I/O. In order to make this work for end-users, a notification mechanism is required, and nanomsg, in the spirit of following POSIX, offers a notification method based on `poll(2)` or `select(2)`. In order for this to work, it offers up a selectable file descriptor for send and another one for receive. When events occur, these are written to, and the user application "clears" these by reading from them. (This is done on behalf of the application by _nanomsg_'s API calls.) This means that in addition to the context switch code, there are not fewer than 2 extra system calls executed per message sent or received, and on a mostly idle system as many as 3. This means that to send a message from one process to another you may have to execute up to 6 extra system calls, beyond the 2 required to actually send and receive the message. NOTE: Its even more hideous to support this on Windows, where there is no `pipe(2)` system call, so we have to cobble up a loopback TCP connection just for this event notification, in addition to the system call explosion. There are cases where this file descriptor logic is easier for existing applications to integrate into event loops (e.g. they already have a thread blocked in `poll()`.) But for many cases this is not necessary. A simple callback mechanism would be far better, with the FDs available only as an option for code that needs them. This is the approach that we have taken with _nng_. As another consequence of our approach, we do not require file descriptors for sockets at all, so it is possible to create applications containing _many_ thousands of `inproc` sockets with no files open at all. (Obviously if you're going to perform real I/O to other processes or other systems, you're going to need to have the underlying transport file descriptors open, but then the only real limit should be the number of files that you can open on your system. And the number of active connections you can maintain should ideally approach that system limit closely.) === POSIX APIs Another of Martin's goals, which seems worthwhile at first, was the attempt to provide a familiar POSIX API (based upon the BSD socket API). As a C programmer coming from UNIX systems, this really attracted me. The problem is that the POSIX APIs are actually really horrible. In particular the semantics around `cmsg` are about as arcane and painful as one can imagine. Largely, this has meant that extensions to the `cmsg` API simply have not occurred in _nanomsg_. The `cmsg` API specified by POSIX is as bad as it is because POSIX had requirements not to break APIs that already existed, and they needed to shim something that would work with existing implementations, including getting across a system call boundary. _nanomsg_ has never had such constraints. Oh, and there was that whole "design by committee" aspect. Attempting to retain low numbered "socket descriptors" had its own problems -- a huge source of use-after-close bugs, which made the use of `nn_close()` incredibly dangerous for multithreaded sockets. (If one thread closes and opens a new socket, other threads still using the old socket might wind up accessing the "new" socket without realizing it.) The other thing is that BSD socket APIs are super familiar to UNIX C programmers -- but experience with _nanomsg_ has taught us already that these are actually in the minority of _nanomsg_'s users. Most of our users are coming to us from {cpp} (object oriented), _Java_, and _Python_ backgrounds. For them the BSD sockets API is frankly somewhat bizarre and alien. With _nng_, we realized that constraining ourselves to the mistakes of the POSIX API was hurting rather than helping. So _nng_ provides a much friendlier interface for getting properties associated with messages. In _nng_ we also generally try hard to avoid reusing an identifier until no other option exists. This generally means most applications won't see socket reuse until billions of other sockets have been opened. There is little chance for accidental reuse. == Compatibility Of course, there are a number of existing _nanomsg_ consumers "in the wild" already. It is important to continue to support them. So I decided from the get go to implement a "compatibility" layer, that provides the same API, and as much as possible the same ABI, as legacy _nanomsg_. However, new features and capabilities would not necessarily be exposed to the the legacy API. Today _nng_ offers this. You can relink an existing _nanomsg_ binary against _libnng_ instead of _libnn_, and it usually Just Works(TM). Source compatibility is almost as easy, although the application code needs to be modified to use different header files. NOTE: I am considering changing the include file in the future so that it matches exactly the _nanomsg_ include path, so that only a compiler flag change would be needed. == Asynchronous IO As a consequence of our experience with threads being so unscalable, we decided to create a new underlying abstraction modeled largely on Windows IO completion ports. (As bad as so many of the Windows APIs are, the IO completion port stuff is actually pretty nice.) Under the hood in _nng_ all I/O is asynchronous, and we have `nni_aio` objects for each pending I/O. These have an associated completion routine. The completion routines are _usually_ run on a separate worker thread (we have many such workers; in theory the number should be tuned to the available number of CPU cores to ensure that we never wait while a CPU core is available for work), but they can be run "synchronously" if the I/O provider knows it is safe to do so (for example the completion is occuring in a context where no locks are held.) The `nni_aio` structures are accessible to user applications as well, which can lead to much more efficient and easier to write asynchronous applications, and can aid integration into event-driven systems and runtimes, without requiring extra system calls required by the legacy _nanomsg_ approach. There is still performance tuning work to do, especially optimization for specific pollers like `epoll()` and `kqueue()` to address the C10K problem, but that work is already in progress. == Portability & Embeddability A significant goal of _nng_ is to be portable to many kinds of different kinds of systems, and embedded in systems that do not support POSIX or Win32 APIs. To that end we have a clear platform portability layer. We do require that platforms supply entry points for certain networking, synchronization, threading, and timekeeping functions, but these are fairly straight-forward to implement on any reasonable 32-bit or 64-bit system, including most embedded operating systems. Additionally, this portability layer may be used to build other kinds of experiments -- for example it should be relatively straight-forward to provide a "platform" based on one of the various coroutine libraries such as Martin's http://libdill.org[libdill] or https://swtch.com/libtask/[libtask]. TIP: If you want to write a coroutine-based platform, let me know! == New Transports The other, most critical, motivation behind _nng_ was to enable an easier creation of new transports. In particular, one client ( http://www.capitar.com[Capitar IT Group BV]) contracted the creation of a http://www.zerotier.com[ZeroTier] transport for _nanomsg_. After beating my head against the state machines some more, I finally asked myself if it would not be easier just to rewrite _nanomsg_ using the model I had created for _mangos_. In retrospect, I'm not sure that the answer was a clear and definite yes in favor of _nng_, but for the other things I want to do, it has enabled a lot of new work. The ZeroTier transport was created with a relatively modest amount of effort, in spite of being based upon a connectionless transport. I do not believe I could have done this easily in the existing _nanomsg_. I've since added a rich TLS transport, and have implemented a WebSocket transport that is far more capable than that in _nanomsg_, as it can support TLS and sharing the TCP port across multiple _nng_ sockets (using the path to discriminate) or even other HTTP services. There are already plans afoot for other kinds of transports using QUIC or KCP or SSH, as well as a pure UDP transport. The new _nng_ transport layer makes implementation of these all fairly straight-forward. == HTTP and Other services As part of implementing a real WebSocket transport, it was necessary to implement at least some HTTP capabilities. Rather than just settle for a toy implementation, _nng_ has a very capable HTTP server and client framework. The server can be used to build real web services, so it becomes possible for example to serve static content, REST API, and _nng_ based services all from the same TCP port using the same program. We've also made the WebSocket services fairly generic, which may support a plethora of other kinds of transports and services. There is also a portability layer -- so some common services (threading, timing, etc.) are provided in the _nng_ library to help make writing portable _nng_ applications easier. It will not surprise me if developers start finding uses for _nng_ that have nothing to do with Scalability Protocols. == Separate Contexts As part of working on a demo suite of applications, I realized that the requirement to use raw mode sockets for concurrent applications was rather onerous, forcing application developers to re-implement much of the same logic that is already in _nng_. Thus was the born the idea of separating the context for protocols from the socket, allowing multiple contexts (each of which managing it's own REQ/REP state machinery) to be allocated and used on a single socket. This was a large change indeed, but we believe application developers are going to find it *much* easier to write scalable applications, and hopefully the uses of raw mode and applications needing to inspect or generate their own application headers will vanish. Note that these contexts are entirely optional -- an application can still use the implicit context associated with the socket just like always, if it has no need for extra concurrency. One side benefit of this work was that we identified several places to make _nng_ perform more efficiently, reducing the number of context switches and extra raw vs. cooked logic. == Towards _nanomsg_ 2.0 It is my intention that _nng_ ultimately replace _nanomsg_. I do think of it as "nanomsg 2.0". In fact "nng" stands for "nanomsg next generation" in my mind. Some day soon I'm hoping that the various website references to nanomsg my simply be updated to point at _nng_. It is not clear to me whether at that time I will simply rename the existing code to _nanomsg_, nanomsg2, or leave it as _nng_.