= Rationale: Or why am I bothering to rewrite nanomsg?
Garrett D'Amore <garrett@damore.org>
v0.3, April 10, 2018


NOTE: You might want to review
      http://nanomsg.org/documentation-zeromq.html[Martin Sustrik's rationale]
      for nanomsg vs. ZeroMQ.


== Background

I became involved in the
http://www.nanomsg.org[nanomsg] community back in 2014, when
I wrote https://github.com/go-mangos/mangos[mangos] as a pure
http://www.golang.org[Go] implementation of the wire protocols behind
_nanomsg_.  I did that work because I was dissatisfied with the
http://zeromq.org[_ZeroMQ_] licensing model
and the {cpp} baggage that came with it. I also needed something that would
work with _Go_ on http://www.illumos.org[illumos], which at the time
lacked support for `cgo` (so I could not just use an FFI binding.)


At the time, it was the only alternate implementation those protocols.
Writing _mangos_ gave me a lot of detail about the internals of _nanomsg_ and
the SP protocols.

It would not be wrong to say that one of the goals of _mangos_ was to teach
me about _Go_.  It was my first non-trivial _Go_ project.

While working with _mangos_, I wound up implementing a number of additional
features, such as a TLS transport, the ability to bind to wild card ports,
and the ability to determine more information about the sender of a message.
This was incredibly useful in a number of projects.

I initially looked at _nanomsg_ itself, as I wanted to add a TLS transport
to it, and I needed to make some bug fixes (for protocol bugs for example),
and so forth.

== Lessons Learned

Perhaps it might be better to state that there were a number of opportunities
to learn from the lessons of _nanomsg_, as well as lessons we learned while
building _nng_ itself.

=== State Machine Madness

What I ran into in _nanomsg_, when attempting to improve it, was a
challenging mess of state machines. _nanomsg_ has dozens of state machines,
many of which feed into others, such that tracking flow through the state
machines is incredibly painful.

Worse, these state machines are designed to be run from a single worker
thread.  This means that a given socket is entirely single theaded; you
could in theory have dozens, hundreds, or even thousands of connections
open, but they would be serviced only by a single thread.  (Admittedly
non-blocking I/O is used to let the OS kernel calls run asynchronously
perhaps on multiple cores, but nanomsg itself runs all socket code on
a single worker thread.)

There is another problem too -- the `inproc` code that moves messages
between one socket and another was incredibly racy.  This is because the
two sockets have different locks, and so dealing with the different
contexts was tricky (and consequently buggy).  (I've since, I think, fixed
the worst of the bugs here, but only after many hours of pulling out hair.)

The state machines also make fairly linear flow really difficult to follow.
For example, there is a state machine to read the header information.  This
may come a byte a time, and the state machine has to add the bytes, check
for completion, and possibly change state, even if it is just reading a
single 32-bit word.  This is a lot more complex than most programmers are
used to, such as `read(fd, &val, 4)`.

Now to be fair, Martin Sustrik had the best intentions when he created the
state machine model around which _nanomsg_ is built.  I do think that from
experience this is one of the most dense and unapproachable parts of _nanomsg_,
in spite of the fact that Martin's goal was precisely the opposite.  I
consider this a "failed experiment" -- but hey failed experiments are the
basis of all great science.

=== Thread Challenges

While _nanomsg_ is mostly internally single threaded, I decided to try to
emulate the simple architecture of _mangos_ using system threads.  (_mangos_
benefits greatly from _Go_'s excellent coroutine facility.)  Having been well
and truly spoiled by _illumos_ threading (and especially _illumos_ kernel
threads), I thought this would be a reasonable architecture.

Sadly, this initial effort, while it worked, scaled incredibly poorly --
even so-called "modern" operating systems like _macOS_ 10.12 and _Windows_ 8.1
simply melted or failed entirely when creating any non-trivial number of
threads.  (To me, creating 100 threads should be a no-brainer, especially if
one limits the stack size appropriately.  I'm used to be able to create
thousands of threads without concern.  As I said, I've been spoiled.
If your system falls over at a mere 200 threads I consider it a toy
implementation of threading. Unfortunately most of the mainstream operating
systems are therefore toy implementations.)

Chalk up another failed experiment.

I did find another approach which is discussed further.

=== File Descriptor Driven

Most of the underlying I/O in _nanomsg_ is built around file descriptors,
and it's internal usock structure, which is also state machine driven.
This means that implementing new transports which might need something
other than a file descriptor, is really non-trivial.  This stymied my
first attempt to add http://www.openssl.org[OpenSSL] support to get TLS
added -- _OpenSSL_ has it's own `struct BIO` for this stuff, and I could
not see an easy way to convert _nanomsg_'s `usock` stuff to accomodate the
`struct BIO`.

In retrospect, _OpenSSL_ wasn't the ideal choice for an SSL/TLS library,
and we have since chosen another (https://tls.mbed.org[mbed TLS]).
Still, we needed an abstraction model that was better than just file
descriptors for I/O.

=== Poll

In order to support use in event driven programming, asynchronous
situations, etc. _nanomsg_ offers non-blocking I/O.  In order to make
this work for end-users, a notification mechanism is required, and
nanomsg, in the spirit of following POSIX, offers a notification method
based on `poll(2)` or `select(2)`.

In order for this to work, it offers up a selectable file descriptor
for send and another one for receive.  When events occur, these are
written to, and the user application "clears" these by reading from
them.  (This is done on behalf of the application by _nanomsg_'s API calls.)

This means that in addition to the context switch code, there are not
fewer than 2 extra system calls executed per message sent or received, and
on a mostly idle system as many as 3.  This means that to send a message
from one process to another you may have to execute up to 6 extra system
calls, beyond the 2 required to actually send and receive the message.

NOTE: Its even more hideous to support this on Windows, where there is no
      `pipe(2)` system call, so we have to cobble up a loopback TCP connection
      just for this event notification, in addition to the system call
      explosion.

There are cases where this file descriptor logic is easier for existing
applications to integrate into event loops (e.g. they already have a thread
blocked in `poll()`.)

But for many cases this is not necessary.  A simple callback mechanism
would be far better, with the FDs available only as an option for code
that needs them.  This is the approach that we have taken with _nng_.

As another consequence of our approach, we do not require file descriptors
for sockets at all, so it is possible to create applications containing
_many_ thousands of `inproc` sockets with no files open at all.  (Obviously
if you're going to perform real I/O to other processes or other systems,
you're going to need to have the underlying transport file descriptors
open, but then the only real limit should be the number of files that you
can open on your system.  And the number of active connections you can maintain
should ideally approach that system limit closely.)

=== POSIX APIs

Another of Martin's goals, which seems worthwhile at first, was the
attempt to provide a familiar POSIX API (based upon the BSD socket API).
As a C programmer coming from UNIX systems, this really attracted me.

The problem is that the POSIX APIs are actually really horrible.  In
particular the semantics around `cmsg` are about as arcane and painful as
one can imagine.  Largely, this has meant that extensions to the `cmsg`
API simply have not occurred in _nanomsg_.

The `cmsg` API specified by POSIX is as bad as it is because POSIX had
requirements not to break APIs that already existed, and they needed to
shim something that would work with existing implementations, including
getting across a system call boundary. _nanomsg_ has never had such
constraints.

Oh, and there was that whole "design by committee" aspect.

Attempting to retain low numbered "socket descriptors" had its own
problems -- a huge source of use-after-close bugs, which made the
use of `nn_close()` incredibly dangerous for multithreaded sockets.
(If one thread closes and opens a new socket, other threads still using
the old socket might wind up accessing the "new" socket without realizing
it.)

The other thing is that BSD socket APIs are super familiar to UNIX C
programmers -- but experience with _nanomsg_ has taught us already that these
are actually in the minority of _nanomsg_'s users.  Most of our users are
coming to us from {cpp} (object oriented), _Java_, and _Python_ backgrounds.
For them the BSD sockets API is frankly somewhat bizarre and alien.

With _nng_, we realized that constraining ourselves to the mistakes of the
POSIX API was hurting rather than helping. So _nng_ provides a much friendlier
interface for getting properties associated with messages.

In _nng_ we also generally try hard to avoid reusing
an identifier until no other option exists.  This generally means most
applications won't see socket reuse until billions of other sockets
have been opened.  There is little chance for accidental reuse.


== Compatibility

Of course, there are a number of existing _nanomsg_ consumers "in the wild"
already.  It is important to continue to support them.  So I decided from
the get go to implement a "compatibility" layer, that provides the same
API, and as much as possible the same ABI, as legacy _nanomsg_.  However,
new features and capabilities would not necessarily be exposed to the
the legacy API.

Today _nng_ offers this.  You can relink an existing _nanomsg_ binary against
_libnng_ instead of _libnn_, and it usually Just Works(TM).  Source
compatibility is almost as easy, although the application code needs to be
modified to use different header files.

NOTE: I am considering changing the include file in the future so that
it matches exactly the _nanomsg_ include path, so that only a compiler
flag change would be needed.

== Asynchronous IO

As a consequence of our experience with threads being so unscalable,
we decided to create a new underlying abstraction modeled largely on
Windows IO completion ports.  (As bad as so many of the Windows APIs
are, the IO completion port stuff is actually pretty nice.)  Under the
hood in _nng_ all I/O is asynchronous, and we have `nni_aio` objects
for each pending I/O.  These have an associated completion routine.

The completion routines are _usually_ run on a separate worker thread
(we have many such workers; in theory the number should be tuned to the
available number of CPU cores to ensure that we never wait while a CPU
core is available for work), but they can be run "synchronously" if
the I/O provider knows it is safe to do so (for example the completion
is occuring in a context where no locks are held.)

The `nni_aio` structures are accessible to user applications as well, which can
lead to much more efficient and easier to write asynchronous applications,
and can aid integration into event-driven systems and runtimes, without
requiring extra system calls required by the legacy _nanomsg_ approach.

There is still performance tuning work to do, especially optimization for
specific pollers like `epoll()` and `kqueue()` to address the C10K problem,
but that work is already in progress.

== Portability & Embeddability

A significant goal of _nng_ is to be portable to many kinds of different
kinds of systems, and embedded in systems that do not support POSIX or Win32
APIs.  To that end we have a clear platform portability layer.  We do require
that platforms supply entry points for certain networking, synchronization,
threading, and timekeeping functions, but these are fairly straight-forward
to implement on any reasonable 32-bit or 64-bit system, including most
embedded operating systems.

Additionally, this portability layer may be used to build other kinds of
experiments -- for example it should be relatively straight-forward to provide
a "platform" based on one of the various coroutine libraries such as Martin's
http://libdill.org[libdill] or https://swtch.com/libtask/[libtask].

TIP: If you want to write a coroutine-based platform, let me know!

== New Transports

The other, most critical, motivation behind _nng_ was to enable an easier
creation of new transports.  In particular, one client (
http://www.capitar.com[Capitar IT Group BV])
contracted the creation of a http://www.zerotier.com[ZeroTier] transport for
_nanomsg_.

After beating my head against the state machines some more, I finally asked
myself if it would not be easier just to rewrite _nanomsg_ using the model
I had created for _mangos_.

In retrospect, I'm not sure that the answer was a clear and definite yes
in favor of _nng_, but for the other things I want to do, it has enabled a
lot of new work.  The ZeroTier transport was created with a relatively
modest amount of effort, in spite of being based upon a connectionless
transport.  I do not believe I could have done this easily in the existing
_nanomsg_.

I've since added a rich TLS transport, and have implemented a WebSocket
transport that is far more capable than that in _nanomsg_, as it can
support TLS and sharing the TCP port across multiple _nng_ sockets (using
the path to discriminate) or even other HTTP services.

There are already plans afoot for other kinds of transports using QUIC
or KCP or SSH, as well as a pure UDP transport.  The new _nng_ transport
layer makes implementation of these all fairly straight-forward.

== HTTP and Other services

As part of implementing a real WebSocket transport, it was necessary to
implement at least some HTTP capabilities.  Rather than just settle for a toy
implementation, _nng_ has a very capable HTTP server and client framework. 
The server can be used to build real web services, so it becomes possible
for example to serve static content, REST API, and _nng_ based services
all from the same TCP port using the same program.

We've also made the WebSocket services fairly generic, which may support
a plethora of other kinds of transports and services.

There is also a portability layer -- so some common services (threading,
timing, etc.) are provided in the _nng_ library to help make writing
portable _nng_ applications easier.

It will not surprise me if developers start finding uses for _nng_ that
have nothing to do with Scalability Protocols.

== Separate Contexts

As part of working on a demo suite of applications, I realized that the
requirement to use raw mode sockets for concurrent applications was rather
onerous, forcing application developers to re-implement much of the
same logic that is already in _nng_.

Thus was the born the idea of separating the context for protocols from
the socket, allowing multiple contexts (each of which managing it's own
REQ/REP state machinery) to be allocated and used on a single socket.

This was a large change indeed, but we believe application developers
are going to find it *much* easier to write scalable applications,
and hopefully the uses of raw mode and applications needing to inspect
or generate their own application headers will vanish.

Note that these contexts are entirely optional -- an application can
still use the implicit context associated with the socket just like
always, if it has no need for extra concurrency.

One side benefit of this work was that we identified several places
to make _nng_ perform more efficiently, reducing the number of context
switches and extra raw vs. cooked logic.

== Towards _nanomsg_ 2.0

It is my intention that _nng_ ultimately replace _nanomsg_.  I do think of it
as "nanomsg 2.0".  In fact "nng" stands for "nanomsg next generation" in
my mind.  Some day soon I'm hoping that the various website
references to nanomsg my simply be updated to point at _nng_.  It is not
clear to me whether at that time I will simply rename the existing
code to _nanomsg_, nanomsg2, or leave it as _nng_.