================
Server Selection
================

:Spec: 103
:Title: Server Selection
:Author: David Golden
:Lead: Bernie Hackett
:Advisors: \A. Jesse Jiryu Davis, Samantha Ritter, Robert Stam, Jeff Yemin
:Status: Accepted
:Type: Standards
:Last Modified: November 21, 2016
:Version: 1.5

.. contents::

Abstract
========

MongoDB deployments may offer more than one server that can service an
operation.  This specification describes how MongoDB drivers and mongos shall
select a server for either read or write operations.  It includes the definition
of a "read preference" document, configuration options, and algorithms for
selecting a server for different deployment topologies.

Meta
====

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this document are to be
interpreted as described in `RFC 2119`_.

.. _RFC 2119: https://www.ietf.org/rfc/rfc2119.txt

Motivation for Change
=====================

This specification builds upon the prior "Driver Read Preference"
specification, which had a number of omissions, flaws
or other deficiencies:

#.  Mandating features that implied monotonicity for situations where
    monotonicity is not guaranteed

#.  Mandating features that are not supported by mongos

#.  Neglecting to specify a single, standard way to calculate average latency
    times

#.  Specifying complex command-helper rules

#.  Omitting rules for applying read preferences to a single server or to
    select among multiple mongos servers

#.  Omitting test cases for verification of spec compliance

This revision addresses these problems as well as improving structure and
specificity.

Additionally, it adds specifications for server selection more broadly:

*   Selection of a server for write operations

*   Server selection retry and timeout

Specification
=============

Scope and general requirements
------------------------------

This specification describes how MongoDB drivers and mongos select a server
for read and write operations, including commands, OP_QUERY, OP_INSERT, OP_UPDATE,
and OP_DELETE.  For read operations, it describes how drivers and mongos
shall interpret a read preference document.

This specification does not apply to OP_GET_MORE or OP_KILL_CURSORS
operations on cursors, which need to go to the same server that received an
OP_QUERY and returned a cursor ID.

Drivers and mongos MUST conform to the semantics of this document, but SHOULD
use language-appropriate data models or variable names.

This specification does not apply to commands issued for server monitoring or
authentication.

Terms
-----

**Available**
    Describes a server that is believed to be reachable over the network and
    able to respond to requests.  A server of type Unknown or PossiblePrimary
    is not available; other types are available.

**Client**
    Software that communicates with a MongoDB deployment.  This includes both
    drivers and mongos.

**Candidate**
   Describes servers in a deployment that enter the selection process,
   determined by the read preference ``mode`` parameter and the servers' type.
   Depending on the ``mode``, candidate servers might only include secondaries
   or might apply to all servers in the deployment.

**Deployment**
    One or more servers that collectively provide access to a single logical
    set of MongoDB databases.

**Command**
    An OP_QUERY operation targeting the '$cmd' collection namespace.

**Direct connection**
    A driver connection mode that sends all database operations to a single
    server without regard for type.

.. _eligible:

**Eligible**
    Describes candidate servers that also meet the criteria specified by the
    ``tag_sets`` and ``maxStalenessSeconds`` read preference parameters.

**Immediate topology check**
    For a multi-threaded or asynchronous client, this means waking all
    server monitors for an immediate check.  For a single-threaded client,
    this means a (blocking) scan of all servers.

**Latency window**
    When choosing between several suitable servers, the latency window is the
    range of acceptable RTTs from the shortest RTT to the shortest RTT plus the
    local threshold.  E.g. if the shortest RTT is 15ms and the local threshold
    is 200ms, then the latency window ranges from 15ms - 215ms.

**Local threshold**
    The maximum acceptable difference in milliseconds between the shortest RTT
    and the longest RTT of servers suitable to be selected.

**Mode**
    One of several enumerated values used as part of a read preference, defining
    which server types are candidates for reads and the semantics for choosing a
    specific one.

**Primary**
    Describes a server of type RSPrimary.

**Query**
    An OP_QUERY operation targeting a regular (non '$cmd') collection namespace.

**Read preference**
    The parameters describing which servers in a deployment can receive
    read operations, including ``mode``, ``tag_sets``, and ``maxStalenessSeconds``.

**RS**
    Abbreviation for "replica set".

**RTT**
    Abbreviation for "round trip time".

**Round trip time**
    The time in milliseconds to execute an ``ismaster`` command and
    receive a response for a given server.  This spec differentiates between
    the RTT of a single ``ismaster`` command and a server's *average* RTT over
    several such commands.

**Secondary**
    A server of type RSSecondary.

**Staleness**
    A worst-case estimate of how far a secondary's replication lags behind the primary's last write.

**Server**
    A mongod or mongos process.

**Server selection**
    The process by which a server is chosen for a database operation out of all
    potential servers in a deployment.

**Server type**
    An enumerated type indicating whether a server is up or down, whether it is
    a mongod or mongos, whether it belongs to a replica set and, if so, what
    role it serves in the replica set.  See the `Server Discovery and Monitoring`_
    spec for more details.

**Suitable**
    Describes a server that meets all specified criteria for a read or write
    operation.

**Tag**
    A single key/value pair describing either (1) a user-specified
    characteristic of a replica set member or (2) a desired characteristic for
    the target of a read operation.  The key and value have no semantic meaning
    to the driver; they are arbitrary user choices.

**Tag set**
    A document of zero or more tags.  Each member of a replica set can be
    configured with zero or one tag set.

**Tag set list**
    A list of zero or more tag sets.  A read preference might have a tag set list
    used for selecting servers.

**Topology**
    The state of a deployment, including its type, which servers are
    members, and the server types of members.

**Topology type**
    An enumerated type indicating the semantics for monitoring servers and
    selecting servers for database operations.  See the `Server Discovery and
    Monitoring`_ spec for more details.

Assumptions
-----------

1.  Unless they explicitly override these priorities, we assume our users
    prefer their applications to be, in order:

    - Predictable: the behavior of the application should not change based on
      the deployment type, whether single mongod, replica set or sharded cluster.

    - Resilient: applications will adapt to topology changes, if possible,
      without raising errors or requiring manual reconfiguration.

    - Low-latency: all else being equal, faster responses to queries and writes
      are preferable.

2.  Clients know the state of a deployment based on some form of ongoing
    monitoring, following the rules defined in the `Server Discovery and
    Monitoring`_ spec.

    - They know which members are up or down, what their tag sets are, and
      their types.

    - They know average round trip times to each available member.

    - They detect reconfiguration and the addition or removal of members.

3.  The state of a deployment could change at any time, in between any network
    interaction.

    - Servers might or might not be reachable; they can change type at any
      time, whether due to partitions, elections, or misconfiguration.

    - Data rollbacks could occur at any time.

MongoClient Configuration
-------------------------

Selecting a server requires the following client-level configuration
options:

localThresholdMS
~~~~~~~~~~~~~~~~~~

This defines the size of the latency window for selecting among multiple
suitable servers. The default is 15 (milliseconds).  It MUST be configurable at
the client level.  It MUST NOT be configurable at the level of a database
object, collection object, or at the level of an individual query.

In the prior read preference specification, ``localThresholdMS`` was called
``secondaryAcceptableLatencyMS`` by drivers.  Drivers MUST support the new
name for consistency, but MAY continue to support the legacy name to avoid
a backward-breaking change.

mongos currently uses ``localThreshold`` and MAY continue to do so.

serverSelectionTimeoutMS
~~~~~~~~~~~~~~~~~~~~~~~~

This defines how long to block for server selection before throwing an
exception.  The default is 30,000 (milliseconds).  It MUST be configurable at
the client level.  It MUST NOT be configurable at the level of a database
object, collection object, or at the level of an individual query.

This default value was chosen to be sufficient for a typical server primary
election to complete.  As the server improves the speed of elections, this
number may be revised downward.

Users that can tolerate long delays for server selection when the topology
is in flux can set this higher.  Users that want to "fail fast" when the
topology is in flux can set this to a small number.

A serverSelectionTimeoutMS of zero MAY have special meaning in some drivers;
zero's meaning is not defined in this spec, but all drivers SHOULD document
the meaning of zero.

serverSelectionTryOnce
~~~~~~~~~~~~~~~~~~~~~~

Single-threaded drivers MUST provide a "serverSelectionTryOnce" mode,
in which the driver scans the topology exactly once after server selection fails,
then either selects a server or raises an error.

The serverSelectionTryOnce option MUST be true by default.
If it is set false, then the driver repeatedly searches for an appropriate server
for up to serverSelectionTimeoutMS milliseconds
(pausing `minHeartbeatFrequencyMS
<https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#minheartbeatfrequencyms>`_
between attempts, as required by the `Server Discovery and Monitoring`_
spec).

Users of single-threaded drivers MUST be able to control this mode in one or
both of these ways:

* In code, pass true or false for an option called serverSelectionTryOnce,
  spelled idiomatically for the language, to the MongoClient constructor.
* Include "serverSelectionTryOnce=true" or "serverSelectionTryOnce=false"
  in the URI. The URI option is spelled the same for all drivers.

Conflicting usages of the URI option and the symbol is an error.

Multi-threaded drivers MUST NOT provide this mode.
(See `single-threaded server selection implementation`_
and the rationale for a `"try once" mode`_.)

heartbeatFrequencyMS
~~~~~~~~~~~~~~~~~~~~

This controls when topology updates are scheduled.
See `heartbeatFrequencyMS`_ in the `Server Discovery and Monitoring`_ spec for details.

idleWritePeriodMS
~~~~~~~~~~~~~~~~~

A constant, how often an idle primary writes a no-op to the oplog.
See `idleWritePeriodMS`_ in the `Max Staleness`_ spec for details.

smallestMaxStalenessSeconds
~~~~~~~~~~~~~~~~~~~~~~~~~~~

A constant, 90 seconds. See "Smallest allowed value for maxStalenessSeconds"
in the Max Staleness Spec.

Read Preference
---------------

A read preference determines which servers are considered suitable for read
operations.  Read preferences are interpreted differently based on topology
type.  See topology-type-specific server selection rules for details.

When no servers are suitable, the selection might be retried or will eventually
fail following the rules described in the `Rules for server selection`_
section.

Components of a read preference
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A read preference consists of a ``mode`` and optional
``tag_sets`` and ``maxStalenessSeconds``.  The ``mode`` prioritizes between primaries and
secondaries to produce either a single suitable server or a list of candidate
servers.  If ``tag_sets`` and ``maxStalenessSeconds`` are set, they determine
which candidate servers are eligible for selection.

The default ``mode`` is 'primary'.  The default ``tag_sets``
is a list with an empty tag set: ``[{}]``. The default ``maxStalenessSeconds``
is -1 or null, depending on the language.

Each is explained in greater detail below.

mode
````

For a deployment with topology type ReplicaSetWithPrimary or
ReplicaSetNoPrimary, the ``mode`` parameter controls whether primaries or
secondaries are deemed suitable.  Topology types Single and Sharded have
different selection criteria and are described elsewhere.

Clients MUST support these modes:

**primary**
    Only an available primary is suitable.

**secondary**
    All secondaries (and *only* secondaries) are candidates, but only
    `eligible`_ candidates (i.e. after applying ``tag_sets`` and ``maxStalenessSeconds``) are suitable.

**primaryPreferred**
    If a primary is available, only the primary is suitable.  Otherwise,
    all secondaries are candidates, but only eligible secondaries are suitable.

**secondaryPreferred**
    All secondaries are candidates. If there is at least one eligible
    secondary, only eligible secondaries are suitable.  Otherwise, when there
    are no eligible secondaries, the primary is suitable.

**nearest**
    The primary and all secondaries are candidates, but only eligible
    candidates are suitable.

*Note on other server types*: The `Server Discovery and Monitoring`_ spec defines
several other server types that could appear in a replica set.  Such types are never
candidates, eligible or suitable.

.. _algorithm for filtering by staleness:

maxStalenessSeconds
```````````````````

The maximum replication lag, in wall clock time, that a secondary can suffer
and still be eligible.

The default is no maximum staleness.

A ``maxStalenessSeconds`` of -1 MUST mean "no maximum". Drivers are also free to use
None, null, or other representations of "no value" to represent "no max staleness".

Drivers MUST raise an error if ``maxStalenessSeconds`` is a positive number
and the ``mode`` field is 'primary'.

A driver MUST raise an error
if the TopologyType is ReplicaSetWithPrimary or ReplicaSetNoPrimary
and either of these conditions is false::

  maxStalenessSeconds * 1000 >= heartbeatFrequencyMS + idleWritePeriodMS
  maxStalenessSeconds >= smallestMaxStalenessSeconds

``heartbeatFrequencyMS`` is defined in the `Server Discovery and Monitoring`_ spec,
and ``idleWritePeriodMS`` is defined to be 10 seconds in the `Max Staleness`_ spec.

See "Smallest allowed value for maxStalenessSeconds" in the Max Staleness Spec.

mongos MUST reject a read with ``maxStalenessSeconds`` provided and a ``mode`` of 'primary'.

mongos MUST reject a read with ``maxStalenessSeconds`` that is not a positive integer.

mongos MUST reject a read if ``maxStalenessSeconds`` is less than smallestMaxStalenessSeconds,
with error code 160 (SERVER-24421).

During server selection,
drivers (but not mongos) MUST raise an error if ``maxStalenessSeconds`` is a positive number,
and any server's ``maxWireVersion`` is less than 5. [#]_

After filtering servers according to ``mode``, and before filtering with ``tag_sets``,
eligibility MUST be determined from ``maxStalenessSeconds`` as follows:

- If ``maxStalenessSeconds`` is not a positive number, then all servers are eligible.

- Otherwise, calculate staleness. Non-secondary servers (including Mongos
  servers) have zero staleness.
  If TopologyType is ReplicaSetWithPrimary,
  a secondary's staleness is calculated using its ServerDescription "S"
  and the primary's ServerDescription "P"::

    (S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS

  (All datetime units are in milliseconds.)

  If TopologyType is ReplicaSetNoPrimary,
  a secondary's staleness is calculated using its ServerDescription "S"
  and the ServerDescription of the secondary with the greatest lastWriteDate,
  "SMax"::

    SMax.lastWriteDate - S.lastWriteDate + heartbeatFrequencyMS

  Servers with staleness less than or equal to ``maxStalenessSeconds`` are eligible.

See the Max Staleness Spec for overall description and justification of this
feature.

.. _algorithm for filtering by tag_sets:

tag_sets
````````

The read preference ``tag_sets`` parameter is an ordered list of tag sets used
to restrict the eligibility of servers, such as for data center awareness.

Clients MUST raise an error if a non-empty tag set is given in ``tag_sets``
and the ``mode`` field is 'primary'.

A read preference tag set (``T``) matches a server tag set (``S``) –
or equivalently a server tag set (``S``) matches a read preference
tag set (``T``) — if ``T`` is a subset of ``S`` (i.e. ``T ⊆ S``).

For example, the read preference tag set "\{ dc: 'ny', rack: 2 \}" matches a
secondary server with tag set "\{ dc: 'ny', rack: 2, size: 'large' \}".

A tag set that is an empty document matches any server, because the empty
tag set is a subset of any tag set.  This means the default ``tag_sets``
parameter (``[{}]``) matches all servers.

Tag sets are applied after filtering servers by ``mode`` and ``maxStalenessSeconds``,
and before selecting one server within the latency window.

Eligibility MUST be determined from ``tag_sets`` as follows:

- If the ``tag_sets`` list is empty then all candidate servers are eligible
  servers.  (Note, the default of ``[{}]`` means an empty list probably won't
  often be seen, but if the client does not forbid an empty list, this rule
  MUST be implemented to handle that case.)

- If the ``tag_sets`` list is not empty, then tag sets are tried in order until
  a tag set matches at least one candidate server. All candidate servers
  matching that tag set are eligible servers.  Subsequent tag sets in the list
  are ignored.

- If the ``tag_sets`` list is not empty and no tag set in the list matches any
  candidate server, no servers are eligible servers.

Read preference configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Drivers MUST allow users to configure a default read preference on a
``MongoClient`` object.  Drivers MAY allow users to configure a default read
preference on a ``Database`` or ``Collection`` object.

A read preference MAY be specified as an object, document or individual
``mode``, ``tag_sets``, and ``maxStalenessSeconds`` parameters,
depending on what is most idiomatic for the language.

If more than one object has a default read preference, the default of the most
specific object takes precedence.  I.e. ``Collection`` is preferred over
``Database``, which is preferred over ``MongoClient``.

Drivers MAY allow users to set a read preference on queries on a per-operation
basis similar to how ``addSpecial``, ``hint``, or ``batchSize`` are set. E.g.,
in Python::

    db.collection.find({}, read_preference=ReadPreference.SECONDARY)
    db.collection.find(
        {},
        read_preference=ReadPreference.NEAREST,
        tag_sets=[{'dc': 'ny'}],
        maxStalenessSeconds=120)

If a driver API allows users to potentially set both the legacy ``slaveOK``
configuration option and a default read preference configuration option,
passing a value for both MUST be an error. (See `Use of slaveOk`_ for the two
uses of ``slaveOK``.)

Passing read preference to mongos
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If a server of type Mongos is selected for a read operation, the read
preference is passed to the selected mongos through the use of the
``slaveOK`` wire protocol flag, the ``$readPreference`` query
modifier or both, according to the following rules.

If the read preference contains **only** a ``mode`` parameter and the mode is
'primary' or 'secondaryPreferred', for maximum backwards compatibility with
older versions of mongos, drivers MUST only use the value of the ``slaveOK``
wire protocol flag (i.e. set or unset) to indicate the desired read preference
and MUST NOT use a ``$readPreference`` query modifier.

Therefore, when sending queries to a mongos, the following rules apply:

  - For mode 'primary', drivers MUST NOT set the ``slaveOK`` wire protocol flag
    and MUST NOT use ``$readPreference``

  - For mode 'secondary', drivers MUST set the ``slaveOK`` wire protocol flag
    and MUST also use ``$readPreference``

  - For mode 'primaryPreferred', drivers MUST set the ``slaveOK`` wire protocol flag
    and MUST also use ``$readPreference``

  - For mode 'secondaryPreferred', drivers MUST set the ``slaveOK`` wire protocol flag.
    If the read preference contains a non-empty ``tag_sets`` parameter,
    or ``maxStalenessSeconds`` is a positive integer, drivers MUST
    use ``$readPreference``; otherwise, drivers MUST NOT use ``$readPreference``

  - For mode 'nearest', drivers MUST set the ``slaveOK`` wire protocol flag
    and MUST also use ``$readPreference``

The ``$readPreference`` query modifier sends the read preference as part of the
query.  The read preference fields ``tag_sets`` is represented in a ``$readPreference``
document using the field name ``tags``.

When any ``$`` modifier is used, including the ``$readPreference`` modifier,
the query MUST be provided using the ``$query`` modifier like so::

    {
        $query: {
            field1: 'query_value',
            field2: 'another_query_value'
        },
        $readPreference: {
            mode: 'secondary',
            tags: [ { 'dc': 'ny' } ],
            maxStalenessSeconds: 120
        }
    }

A valid ``$readPreference`` document for mongos has the following requirements:

1.  The ``mode`` field MUST be present exactly once with the mode represented
    in camel case:

    - 'primary'
    - 'secondary'
    - 'primaryPreferred'
    - 'secondaryPreferred'
    - 'nearest'

2.  If the ``mode`` field is "primary", the ``tags`` and ``maxStalenessSeconds`` fields MUST be absent.

    Otherwise, for other ``mode`` values, the ``tags`` field MUST either be
    absent or be present exactly once and have an array value containing at
    least one document. It MUST contain only documents, no other type.

    The ``maxStalenessSeconds`` field MUST be either be absent or be present
    exactly once with an integer value.

Mongos receiving a query with ``$readPreference`` SHOULD validate the
``mode``, ``tags``, and ``maxStalenessSeconds`` fields according to rules 1 and 2 above,
but SHOULD ignore unrecognized fields for
forward-compatibility rather than throwing an error.

Use of read preferences with commands
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Because some commands are used for writes, deployment-changes or other
state-changing side-effects, the use of read preference by a driver depends on
the command and how it is invoked:

1.  Write commands: ``insert``, ``update``, ``delete``, ``findAndModify``

    Write commands are considered write operations and MUST follow the
    corresponding `Rules for server selection`_ for each topology type.

2.  Generic command method: typically ``command`` or ``runCommand``

    The generic command method MUST act as a read operation for the purposes of
    server selection.

    The generic command method has a default read preference of ``mode``
    'primary'.  The generic command method MUST ignore any default read
    preference from client, database or collection configuration.  The generic
    command method SHOULD allow an optional read preference argument.

    If an explicit read preference argument is provided as part of the generic
    command method call, it MUST be used for server selection, regardless of
    the name of the command. It is up to the user to use an appropriate read
    preference, e.g.  not calling ``renameCollection`` with a ``mode`` of
    'secondary'.

3.  Command-specific helper: methods that wrap database commands, like
    ``count``, ``distinct``, ``listCollections`` or ``renameCollection``.

    Command-specific helpers MUST act as read operations for the purposes of
    server selection, with read preference rules defined by the following three
    categories of commands:

    - "must-use-primary":  these commands have state-modifying effects and will
      only succeed on a primary.  An example is ``renameCollection``.

      These command-specific helpers MUST use a read preference ``mode`` of
      'primary', MUST NOT take a read preference argument and MUST ignore any
      default read preference from client, database or collection
      configuration.  Languages with dynamic argument lists MUST throw an error
      if a read preference is provided as an argument.

      Clients SHOULD rely on the server to return a "not master" or other error
      if the command is "must-use-primary".  Clients MAY raise an exception
      before sending the command if the topology type is Single and the server
      type is not "Standalone", "RSPrimary" or "Mongos", but the identification
      of the set of 'must-use-primary' commands is out of scope for this
      specification.

    - "should-use-primary": these commands are intended to be run on a primary,
      but would succeed -- albeit with possibly stale data -- when run against
      a secondary.  An example is ``listCollections``.

      These command-specific helpers MUST use a read preference ``mode`` of
      'primary', MUST NOT take a read preference argument and MUST ignore any
      default read preference from client, database or collection
      configuration.  Languages with dynamic argument lists MUST throw an error
      if a read preference is provided as an argument.

      Clients MUST NOT raise an exception if the topology type is Single.

    - "may-use-secondary": these commands run against primaries or secondaries,
      according to users' read preferences.  They are sometimes called
      "query-like" commands.

      The current list of "may-use-secondary" commands includes:

        - group
        - mapreduce (with out: {inline: 1})
        - aggregate (without $out specified)
        - collStats, dbStats
        - count, distinct
        - geoNear, geoSearch, geoWalk
        - parallelCollectionScan
        - text (but see caveats under `The 'text' command and mongos`_)

      Associated command-specific helpers SHOULD take a read preference
      argument and otherwise MUST use the default read preference from client,
      database or collection configuration.

      The aggregate command succeeds on a secondary unless $out is specified.
      It is the user's responsibility not to aggregate with $out on a secondary.

      If a client provides a specific helper for inline mapreduce, then it is
      "may-use-secondary" and the *regular* mapreduce helper is "must use
      primary". Otherwise mapreduce behaves like the aggregate helper: it is the
      user's responsibility to specify {inline: 1} when running mapreduce on a
      secondary.

    New command-specific helpers implemented in the future will be considered
    "must-use-primary", "should-use-primary" or "may-use-secondary" according
    to the specifications for those future commands.  Command helper
    specifications SHOULD use those terms for clarity.

Rules for server selection
--------------------------

Server selection is a process which takes an operation type (read or write), a
ClusterDescription, and optionally a read preference and, on success, returns a
ServerDescription for an operation of the given type.

Server selection varies depending on whether a client is
multi-threaded/asynchronous or single-threaded because a single-threaded
client cannot rely on the topology state being updated in the background.

Multi-threaded or asynchronous server selection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A driver that uses multi-threaded or asynchronous monitoring MUST unblock
waiting operations as soon as server selection completes, even if not all
servers have been checked by a monitor.  Put differently, the client MUST NOT
block server selection while waiting for server discovery to finish.

For example, if the client is discovering a replica set and the application
attempts a read operation with mode 'primaryPreferred', the operation MUST
proceed immediately if a suitable secondary is found, rather than blocking
until the client has checked all members and possibly discovered a primary.

The number of threads allowed to wait for server selection SHOULD be either
(a) the same as the number of threads allowed to wait for a connection from
a pool; or (b) governed by a global or client-wide limit on number of
waiting threads, depending on how resource limits are implemented by a
driver.

For multi-threaded clients, the server selection algorithm is
as follows:

1. Record the server selection start time

2. If the topology wire version is invalid, raise an error

3. Find suitable servers by topology type and operation type

4. If there are any suitable servers, choose one at random from those
   within the latency window and return it; otherwise, continue to step #5

5. Request an immediate topology check, then block the server selection
   thread until the topology changes or until the server selection
   timeout has elapsed

6. If more than ``serverSelectionTimeoutMS`` milliseconds have elapsed since
   the selection start time, raise a `server selection error`_

7. Goto Step #2

Single-threaded server selection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Single-threaded drivers do not monitor the topology in the background.
Instead, they MUST periodically update the topology during server selection
as described below.

When ``serverSelectionTryOnce`` is true, ``serverSelectionTimeoutMS`` has
no effect; a single immediate topology check will be done if the topology
starts stale or if the first selection attempt fails.

When ``serverSelectionTryOnce`` is false, then the server selection loops
until a server is successfully selected or until
``serverSelectionTimeoutMS`` is exceeded.

Therefore, for single-threaded clients, the server selection algorithm is
as follows:

1. Record the server selection start time

2. Record the maximum time as start time plus ``serverSelectionTimeoutMS``

3. If the topology has not been scanned in ``heartbeatFrequencyMS``
   milliseconds, mark the topology stale

4. If the topology is stale, proceed as follows:

   - record the target scan time as last scan time plus ``minHeartBeatFrequencyMS``

   - if `serverSelectionTryOnce`_ is false and the target scan time would
     exceed the maximum time, raise a `server selection error`_

   - if the current time is less than the target scan time, sleep until
     the target scan time

   - do a blocking immediate topology check (which must also update the
     last scan time and mark the topology as no longer stale)

5. If the topology wire version is invalid, raise an error

6. Find suitable servers by topology type and operation type

7. If there are any suitable servers, choose one at random from those
   within the latency window and return it; otherwise, mark the topology
   stale and continue to step #8

8. If `serverSelectionTryOnce`_ is true and the last scan time is newer than
   the selection start time, raise a `server selection error`_; otherwise,
   goto Step #4

9. If the current time exceeds the maximum time, raise a
   `server selection error`_

10. Goto Step #4

Before using a socket to the selected server, drivers MUST check whether
the socket has been used in `socketCheckIntervalMS
<https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#socketcheckintervalms>`_
milliseconds (as
defined in the `Server Discovery and Monitoring`_ specification).  If the
socket has been idle for longer, the driver MUST update the
ServerDescription for the selected server.  After updating, if the server
is no longer suitable, the driver MUST repeat the server selection
algorithm and select a new server.

Because single-threaded selection can do a blocking immediate check,
``serverSelectionTimeoutMS`` is not a hard deadline.  The actual
maximum server selection time for any given request can vary from
``serverSelectionTimeoutMS`` minus ``minHeartbeatFrequencyMS`` to
``serverSelectionTimeoutMS`` plus the time required for a blocking scan.

Single-threaded drivers MUST document that when ``serverSelectionTryOne``
is true, selection may take up to the time required for a blocking scan,
and when ``serverSelectionTryOne`` is false, selection may take up to
``serverSelectionTimeoutMS`` plus the time required for a blocking scan.

Topology type: Unknown
~~~~~~~~~~~~~~~~~~~~~~

When a deployment has topology type "Unknown", no servers are suitable for read or write
operations.

Topology type: Single
~~~~~~~~~~~~~~~~~~~~~

A deployment of topology type Single contains only a single server of any type.
Topology type Single signifies a direct connection intended to receive all read
and write operations.

Therefore, read preference is ignored during server selection with topology
type Single.  The single server is always suitable for reads if it is
available.  Depending on server type, the read preference is communicated
to the server differently:

- Type Mongos: the read preference is sent to the server using the rules
  for `Passing read preference to mongos`_.

- For all other types: clients MUST always set the ``slaveOK`` wire
  protocol flag on reads to ensure that any server type can handle the
  request.

The single server is always suitable for write operations if it is available.

If the server is a secondary, write operations will fail with a "not master"
error from the server; this is by design and is a consequence of using a direct
connection to a secondary.

Topology types: ReplicaSetWithPrimary or ReplicaSetNoPrimary
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A deployment with topology type ReplicaSetWithPrimary or ReplicaSetNoPrimary
can have a mix of server types: RSPrimary (only in ReplicaSetWithPrimary),
RSSecondary, RSArbiter, RSOther, RSGhost, Unknown or PossiblePrimary.

Read operations
```````````````

For the purpose of selecting a server for read operations, the same rules apply
to both ReplicaSetWithPrimary and ReplicaSetNoPrimary.

To select from the topology a server that matches the user's Read Preference:

If ``mode`` is 'primary', select the primary server.

If ``mode`` is 'secondary' or 'nearest':

  #. Select all secondaries if ``mode`` is 'secondary', or all secondaries and
     the primary if ``mode`` is 'nearest'.
  #. From these, filter out servers staler than ``maxStalenessSeconds`` if it is a positive number.
  #. From the remaining servers, select servers matching the ``tag_sets``.
  #. From these, select one server within the latency window.

(See `algorithm for filtering by staleness`_, `algorithm for filtering by
tag_sets`_, and `selecting servers within the latency window`_ for details
on each step, and `why is maxStalenessSeconds applied before tag_sets?`_.)

If ``mode`` is 'secondaryPreferred', attempt the selection algorithm with
``mode`` 'secondary' and the user's ``maxStalenessSeconds`` and ``tag_sets``. If
no server matches, select the primary.

If ``mode`` is 'primaryPreferred', select the primary if it is known, otherwise
attempt the selection algorithm with ``mode`` 'secondary' and the user's
``maxStalenessSeconds`` and ``tag_sets``.

For all read preferences modes except 'primary', clients MUST set the ``slaveOK`` wire
protocol flag to ensure that any suitable server can handle the request.  Clients
MUST NOT set the ``slaveOK`` wire protocol flag if the read preference mode is
'primary'.

Write operations
````````````````

If the topology type is ReplicaSetWithPrimary, only an available primary is
suitable for write operations.

If the topology type is ReplicaSetNoPrimary, no servers are suitable for write
operations.

Topology type: Sharded
~~~~~~~~~~~~~~~~~~~~~~

A deployment of topology type Sharded contains one or more servers of type
Mongos or Unknown.

For read operations, all servers of type Mongos are suitable; the ``mode``,
``tag_sets``, and ``maxStalenessSeconds`` read preference parameters are ignored for selecting a
server, but are passed through to mongos. See `Passing read preference to mongos`_.

For write operations, all servers of type Mongos are suitable.

If more than one mongos is suitable, drivers MUST randomly select a suitable
server within the latency window.

Round Trip Times and the Latency Window
---------------------------------------

Calculation of Average Round Trip Times
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For every available server, clients MUST track the average RTT of server
monitoring ``ismaster`` commands.

An Unknown server has no average RTT.  When a server becomes unavailable, its
average RTT MUST be cleared.  Clients MAY implement this idiomatically (e.g
nil, -1, etc.).

When there is no average RTT for a server, the average RTT MUST be set equal to
the first RTT measurement (i.e. the first ``ismaster`` command after the
server becomes available).

After the first measurement, average RTT MUST be computed using an
exponentially-weighted moving average formula, with a weighting factor
(``alpha``) of 0.2.  If the prior average is denoted ``old_rtt``, then the new
average (``new_rtt``) is computed from a new RTT measurement (``x``) using the
following formula::

    alpha = 0.2
    new_rtt = alpha * x + (1 - alpha) * old_rtt

A weighting factor of 0.2 was chosen to put about 85% of the weight of the
average RTT on the 9 most recent observations.

Selecting servers within the latency window
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Server selection results in a set of zero or more suitable servers.  If more
than one server is suitable, a server MUST be selected randomly from among
those within the latency window.

The ``localThresholdMS`` configuration parameter controls the size of the
latency window used to select a suitable server.

The shortest average RTT from among suitable servers anchors one end of the
latency window (``A``).  The other end is determined by adding
``localThresholdMS`` (``B = A + localThresholdMS``).

A server MUST be selected randomly from among suitable servers that have an
average RTT (``RTT``) within the latency window (i.e. ``A ≤ RTT ≤ B``).

In other words, the suitable server with the shortest average RTT is **always**
a possible choice.  Other servers could be chosen if their average RTTs are no
more than ``localThresholdMS`` more than the shortest average RTT.

Requests and Pinning Deprecated
-------------------------------

The prior read preference specification included the concept of a "request",
which pinned a server to a thread for subsequent, related reads.  Requests
and pinning are now **deprecated**.  See `What happened to pinning?`_ for
the rationale for this change.

Drivers with an existing request API MAY continue to provide it for backwards
compatibility, but MUST document that pinning for the request does not
guarantee monotonic reads.

Drivers MUST NOT automatically pin the client or a thread to a particular
server without an explicit ``start_request`` (or comparable) method call.

Outside a legacy "request" API, drivers MUST use server selection for each
individual read operation.

Reference Implementation
========================

The single-threaded reference implementation is the Perl master branch (work
towards v1.0.0).  The multi-threaded reference implementation is TBD.

Implementation Notes
====================

These are suggestions. As always, driver authors should balance cross-language
standardization with backwards compatibility and the idioms of their language.

Modes
-----

Modes ('primary', 'secondary', ...) are constants declared in whatever way is
idiomatic for the programming language. The constant values may be ints,
strings, or whatever.  However, when attaching modes to ``$readPreference``
camel case must be used as described above in `Passing read preference to
mongos`_.

primaryPreferred and secondaryPreferred
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

'primaryPreferred' is equivalent to selecting a server with read preference mode
'primary' (without ``tag_sets`` or ``maxStalenessSeconds``), or, if that fails, falling back to selecting
with read preference mode 'secondary' (with ``tag_sets`` and ``maxStalenessSeconds``, if provided).

'secondaryPreferred' is the inverse: selecting with mode 'secondary' (with
``tag_sets`` and ``maxStalenessSeconds``) and falling back to selecting with mode 'primary' (without
``tag_sets`` or ``maxStalenessSeconds``).

Depending on the implementation, this may result in cleaner code.

nearest
~~~~~~~

The term 'nearest' is unfortunate, as it implies a choice based on geographic
locality or absolute lowest latency, neither of which are true.

Instead, and unlike the other read preference modes, 'nearest' does not favor
either primaries or secondaries; instead all servers are candidates and are
filtered by ``tag_sets`` and ``maxStalenessSeconds``.

To always select the server with the lowest RTT, users should use mode 'nearest'
without ``tag_sets`` or ``maxStalenessSeconds`` and set ``localThresholdMS`` to zero.

To distribute reads across all members evenly regardless of RTT, users should
use mode 'nearest' without ``tag_sets`` or ``maxStalenessSeconds`` and set ``localThresholdMS`` very high so
that all servers fall within the latency window.

In both cases, ``tag_sets`` and ``maxStalenessSeconds`` could be used to further restrict the set of eligible
servers, if desired.

Tag set lists
-------------

Tag set lists can be configured in the driver in whatever way is natural for
the language.

Multi-threaded server selection implementation
----------------------------------------------

The following example uses a single lock for clarity.  Drivers are free to
implement whatever concurrency model best suits their design.

Pseudocode for `multi-threaded or asynchronous server selection`_::

    def getServer(criteria):
        client.lock.acquire()

        now = gettime()
        endTime = now + serverSelectionTimeoutMS

        while true:
            # The topologyDescription keeps track of whether any server has an
            # an invalid wire version range
            if not topologyDescription.compatible:
                client.lock.release()
                throw invalid wire protocol range error with details

            if maxStalenessSeconds is set:
                if any server's maxWireVersion < 5:
                    client.lock.release()
                    throw error

                if topologyDescription.type in (ReplicaSetWithPrimary, ReplicaSetNoPrimary):
                    if (maxStalenessSeconds * 1000 < heartbeatFrequencyMS + idleWritePeriodMS or
                        maxStalenessSeconds < smallestMaxStalenessSeconds):
                    client.lock.release()
                    throw error

            servers = all servers in topologyDescription matching criteria

            if servers is not empty:
                in_window = servers within the latency window
                selected = random entry from in_window
                client.lock.release()
                return selected

            request that all monitors check immediately

            # Wait for a new TopologyDescription. condition.wait() releases
            # client.lock while waiting and reacquires it before returning.
            # While a thread is waiting on client.condition, it is awakened
            # early whenever a server check completes.
            timeout_left = endTime - gettime()
            client.condition.wait(timeout_left)

            if now after endTime:
                client.lock.release()
                throw server selection error


Single-threaded server selection implementation
-----------------------------------------------

Pseudocode for `single-threaded server selection`_::

    def getServer(criteria):
        startTime = gettime()
        loopEndTime = startTime
        maxTime = startTime + serverSelectionTimeoutMS/1000
        nextUpdateTime = topologyDescription.lastUpdateTime
                       + heartbeatFrequencyMS/1000:

        if nextUpdateTime < startTime:
            topologyDescription.stale = true

        while true:

            if topologyDescription.stale:
                scanReadyTime = topologyDescription.lastUpdateTime
                              + minHeartbeatFrequencyMS/1000

                if ((not serverSelectionTryOnce) && (scanReadyTime > maxTime)):
                    throw server selection error with details

                # using loopEndTime below is a proxy for "now" but avoids
                # the overhead of another gettime() call
                sleepTime = scanReadyTime - loopEndTime

                if sleepTime > 0:
                    sleep sleepTime

                rescan all servers
                topologyDescription.lastupdateTime = gettime()
                topologyDescription.stale = false

            # topologyDescription keeps a record of whether any
            # server has an incompatible wire version range
            if not topologyDescription.compatible:
                topologyDescription.stale = true
                throw invalid wire version range error with details

            if maxStalenessSeconds is set:
                if any server's maxWireVersion < 5:
                    throw error

                if topologyDescription.type in (ReplicaSetWithPrimary, ReplicaSetNoPrimary):
                    if (maxStalenessSeconds * 1000 < heartbeatFrequencyMS + idleWritePeriodMS or
                        maxStalenessSeconds < smallestMaxStalenessSeconds):
                    throw error

            servers = all servers in topologyDescription matching criteria

            if servers is not empty:
                in_window = servers within the latency window
                return random entry from in_window
            else:
                topologyDescription.stale = true

            loopEndTime = gettime()

            if serverSelectionTryOnce:
                if topologyDescription.lastUpdateTime > startTime:
                    throw server selection error with details
            else if loopEndTime > maxTime:
                throw server selection error with details

.. _server selection error:

Server Selection Errors
-----------------------

Drivers should use server descriptions and their error attributes (if set) to
return useful error messages.

For example, when there are no members matching the ReadPreference:

- "No server available for query with ReadPreference primary"
- "No server available for query with ReadPreference secondary"
- "No server available for query with ReadPreference " + mode + ", tag set list " + tag_sets + ", and ``maxStalenessSeconds`` " + maxStalenessSeconds

Or, if authentication failed:

- "Authentication failed: [specific error message]"

Here is a sketch of some pseudocode for handling error reporting when errors
could be different across servers::

    if there are any available servers:
        error_message = "No servers are suitable for " + criteria
    else if all ServerDescriptions' errors are the same:
        error_message = a ServerDescription.error value
    else:
        error_message = ', '.join(all ServerDescriptions' errors)

Use of slaveOk
--------------

There are two usages of ``slaveOK``:

1.  A driver query parameter that predated read preference modes and tag
    set lists.

2.  A wire protocol flag on OP_QUERY operations


Using ``slaveOk`` as a query parameter is deprecated. Until it is removed,
``slaveOk`` used as a method argument or query option is considered
equivalent to a read preference ``mode`` of 'secondaryPreferred'

The ``slaveOk`` wire protocol flag remains in the wire protocol and drivers
set this bit for each topology type as described in the specification
above.

Cursors
-------

Cursor operations OP_GET_MORE and OP_KILL_CURSOR do not go through the server
selection process.  Cursor operations must be sent to the original server that
received the query and sent the OP_REPLY.  For exhaust cursors, the same socket
must be used for OP_GET_MORE until the cursor is exhausted.

The 'text' command and mongos
-----------------------------

*Note*: As of MongoDB 2.6, mongos doesn't distribute the "text" command to
secondaries, see SERVER-10947_.

However, the "text" command is deprecated in 2.6, so this command-specific
helper may become deprecated before this is fixed.

.. _SERVER-10947: https://jira.mongodb.org/browse/SERVER-10947

Test Plan
=========

The server selection test plan is given in a separate document that
describes the tests and supporting data files: `Server Selection Tests`_

.. _Server Selection Tests: https://github.com/mongodb/specifications/blob/master/source/server-selection/server-selection-tests.rst

Design Rationale
================

Use of topology types
---------------------

The prior version of the read preference spec had only a loose definition of
server or topology types.  The `Server Discovery and Monitoring`_ spec defines these terms
explicitly and they are used here for consistency and clarity.

Consistency with mongos
-----------------------

In order to ensure that behavior is consistent regardless of topology type,
read preference behaviors are limited to those that mongos can proxy.

For example, mongos ignores read preference 'secondary' when a shard consists of
a single server.  Therefore, this spec calls for topology type Single to ignore
read preferences for consistency.

The spec has been written with the intention that it can apply to both drivers
and mongos and the term "client" has been used when behaviors should apply to
both.  Behaviors that are specific to drivers are largely limited to those
for communicating with a mongos.

New localThresholdMS configuration option name
------------------------------------------------

Because this does not apply **only** to secondaries and does not limit absolute
latency, the name ``secondaryAcceptableLatencyMS`` is misleading.

The mongos name ``localThreshold`` misleads because it has nothing to do with
locality.  It also doesn't include the ``MS`` units suffix for consistency with
other time-related configuration options.

However, given a choice between the two, ``localThreshold`` is a more general
term.  For drivers, we add the ``MS`` suffix for clarity about units and
consistency with other configuration options.

Random selection within the latency window
------------------------------------------

When more than one server is judged to be suitable, the spec calls for random
selection to ensure a fair distribution of work among servers within the
latency window.

It would be hard to ensure a fair round-robin approach given the potential for
servers to come and go.  Making newly available servers either first or last
could lead to unbalanced work.  Random selection has a better fairness
guarantee and keeps the design simpler.

The slaveOK wire protocol flag
------------------------------

In server selection, there is a race condition that could exist between what
a selected server type is believed to be and what it actually is.

The ``slaveOK`` wire protocol flag solves the race problem by communicating
to the server whether a secondary is acceptable.  The server knows its type
and can return a "not master" error if ``slaveOK`` is false and the server
is a secondary.

However, because topology type Single is used for direct connections, we want
read operations to succeed even against a secondary, so the ``slaveOK`` wire
protocol flag must be sent to mongods with topology type Single.

(If the server type is Mongos, follow the rules for `passing read preference to
mongos`_, even for topology type Single.)

General command method going to primary
---------------------------------------

The list of commands that can go to secondaries changes over time and depends
not just on the command but on parameters.  For example, the ``mapReduce``
command may or may not be able to be run on secondaries depending on the value
of the ``out`` parameter.

It significantly simplifies implementation for the general command method
always to go to the primary unless a explicit read preference is set and rely
on users of the general command method to provide a read preference appropriate
to the command.

The command-specific helpers will need to implement a check of read preferences
against the semantics of the command and its parameters, but keeping this logic
close to the command rather than in a generic method is a better design than
either delegating this check to the generic method, duplicating the logic in
the generic method, or coupling both to another validation method.

Average round trip time calculation
-----------------------------------

Using an exponentially-weighted moving average avoids having to store and
rotate an arbitrary number of RTT observations.  All observations count towards
the average.  The weighting makes recent observations count more heavily while
smoothing volatility.

Verbose errors
--------------

Error messages should be sufficiently verbose to allow users and/or support
engineers to determine the reasons for server selection failures from log
or other error messages.

"Try once" mode
---------------

Single-threaded drivers in languages like PHP and Perl are typically deployed
as many processes per application server. Each process must independently
discover and monitor the MongoDB deployment.

When no suitable server is available (due to a partition or misconfiguration),
it is better for each request to fail as soon as its process detects a
problem, instead of waiting and retrying to see if the deployment recovers.

Minimizing response latency is important for maximizing request-handling
capacity and for user experience (e.g. a quick fail message instead of a slow
web page).

However, when a request arrives and the topology information is already stale,
or no suitable server is known,
making a single attempt to update the topology to service the request is
acceptable.

A user of a single-threaded driver who prefers resilience in the face of topology problems,
rather than short response times,
can turn the "try once" mode off.
Then driver rescans the topology every minHeartbeatFrequencyMS
until a suitable server is found or the serverSelectionTimeoutMS expires.

Backwards Compatibility
=======================

In general, backwards breaking changes have been made in the name of
consistency with mongos and avoiding misleading users about monotonicity.

* Features removed:

    - Automatic pinning (see `What happened to pinning?`_)

    - Auto retry (replaced by the general server selection algorithm)

    - mongos "high availability" mode (effectively, mongos pinning)

* Other features and behaviors have changed explicitly

    - Ignoring read preferences for topology type Single

    - Default read preference for the generic command method

* Changes with grandfather clauses

    - Alternate names for ``localThresholdMS``

    - Pinning for legacy request APIs

* Internal changes with little user-visibility

    - Clarifying calculation of average RTT

Questions and Answers
=====================

What happened to pinning?
-------------------------

The prior read preference spec, which was implemented in the versions of the
drivers and mongos released concomitantly with MongoDB 2.2, stated that a
thread / client should remain pinned to an RS member as long as that member
matched the current mode, tags, and acceptable latency. This increased the
odds that reads would be monotonic (assuming no rollback),
but had the following surprising consequence:

1. Thread / client reads with mode 'secondary' or 'secondaryPreferred', gets
   pinned to a secondary
2. Thread / client reads with mode 'primaryPreferred', driver / mongos sees that
   the pinned member (a secondary) matches the mode (which *allows* for a
   secondary) and reads from secondary, even though the primary is available and
   preferable

The old spec also had the swapped problem, reading from the primary with
'secondaryPreferred', except for mongos which was changed at the last minute
before release with SERVER-6565_ ("Do not use primary if secondaries are
available for slaveOk").

This left application developers with two problems:

1. 'primaryPreferred' and 'secondaryPreferred' acted surprisingly and
   unpredictably within requests
2. There was no way to specify a common need: read from a secondary if possible
   with 'secondaryPreferred', then from primary if possible with 'primaryPreferred',
   all within a request. Instead an application developer would have to do the
   second read with 'primary', which would unpin the thread but risk unavailability
   if only secondaries were up.

Additionally, mongos 2.4 introduced the releaseConnectionsAfterResponse option
(RCAR), mongos 2.6 made it the default and mongos 2.8 will remove the ability
to turn it off.  This means that pinning to a mongos offers no guarantee that
connections to shards are pinned.  Since we can't provide the same guarantees
for replica sets and sharded clusters, we removed automatic pinning entirely
and deprecated "requests". See SERVER-11956_ and SERVER-12273_.

Regardless, even for replica sets, pinning offers no monotonicity because of
the ever-present possibility of rollbacks.  Through MongoDB 2.6, secondaries
did not close sockets on rollback, so a rollback could happen between any two
queries without any indication to the driver.

Therefore, an inconsistent feature that doesn't actually do what people think
it does has no place in the spec and has been removed.  Should the server
eventually implement some form of "sessions", this spec will need to be revised
accordingly.

.. _SERVER-6565: https://jira.mongodb.org/browse/SERVER-6565
.. _SERVER-11956: https://jira.mongodb.org/browse/SERVER-11956
.. _SERVER-12273: https://jira.mongodb.org/browse/SERVER-12273

Why change from mongos High Availablity (HA) to random selection?
---------------------------------------------------------------------

Mongos HA has similar problems with pinning, in that one can wind up pinned
to a high-latency mongos even if a lower-latency mongos later becomes
available.

Random selection within the latency window avoids this problem and makes server
selection exactly analogous to having multiple suitable servers from a replica
set.  This is easier to explain and implement.

What happened to auto-retry?
----------------------------

The old auto-retry mechanism was closely connected to server pinning, which has
been removed.  It also mandated exactly three attempts to carry out a query on
different servers, with no way to disable or adjust that value, and only for
the first query within a request.

To the extent that auto-retry was trying to compensate for unavailable servers,
the Server Discovery and Monitoring spec and new server selection algorithm
provide a more robust and configurable way to direct *all* queries to available
servers.

After a server is selected, several error conditions could still occur that
make the selected server unsuitable for sending the operation, such as:

    - the server could have shutdown the socket (e.g. a primary stepping down),

    - a connection pool could be empty, requiring new connections; those
      connections could fail to connect or could fail the server handshake

Once an operation is sent over the wire, several additional error conditions
could occur, such as:

    - a socket timeout could occur before the server responds

    - the server might send an RST packet, indicating the socket was already closed

    - for write operations, the server might return a "not master" error

This specification does not require nor prohibit drivers from attempting
automatic recovery for various cases where it might be considered reasonable to
do so, such as:

    - repeating server selection if, after selection, a socket is determined to
      be unsuitable before a message is sent on it

    - for a read operation, after a socket error, selecting a new server
      meeting the read preference and resending the query

    - for a write operation, after a "not master" error, selecting a new server
      (to locate the primary) and resending the write operation

Driver-common rules for retrying operations (and configuring such retries)
could be the topic of a different, future specification.

Why is maxStalenessSeconds applied before tag_sets?
---------------------------------------------------

The intention of read preference's list of tag sets is to allow a user to prefer
the first tag set but fall back to members matching later tag sets. In order to
know whether to fall back or not, we must first filter by all other criteria.

Say you have two secondaries:

  - Node 1, tagged `{'tag': 'value1'}`, estimated staleness 5 minutes
  - Node 2, tagged `{'tag': 'value2'}`, estimated staleness 1 minute

And a read preference:

  - mode: "secondary"
  - maxStalenessSeconds: 120 (2 minutes)
  - tag_sets: `[{'tag': 'value1'}, {'tag': 'value2'}]`

If tag sets were applied before maxStalenessSeconds, we would select Node 1 since it
matches the first tag set, then filter it out because it is too stale, and be
left with no eligible servers.

The user's intent in specifying two tag sets was to fall back to the second set
if needed, so we filter by maxStalenessSeconds first, then tag_sets, and select
Node 2.

References
==========

- `Server Discovery and Monitoring`_ specification
- `Driver Authentication`_ specification

.. _Server Discovery and Monitoring: https://github.com/mongodb/specifications/tree/master/source/server-discovery-and-monitoring
.. _heartbeatFrequencyMS: https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#heartbeatfrequencyms
.. _Max Staleness: https://github.com/mongodb/specifications/tree/master/source/max-staleness
.. _idleWritePeriodMS: https://github.com/mongodb/specifications/tree/master/source/max-staleness.rst#idleWritePeriodMS
.. _Driver Authentication: https://github.com/mongodb/specifications/blob/master/source/auth

Changes
=======

2015-06-26: Updated single-threaded selection logic with "stale" and serverSelectionTryOnce.

2015-08-10: Updated single-threaded selection logic to ensure a scan always
happens at least once under serverSelectionTryOnce if selection fails.
Removed the general selection algorithm and put full algorithms for each of
the single- and multi-threaded sections. Added a requirement that
single-threaded drivers document selection time expectations.

2016-07-21: Updated for Max Staleness support.

2016-08-03: Clarify selection algorithm, in particular that maxStalenessMS
comes before tag_sets.

2016-10-24: Rename option from "maxStalenessMS" to "maxStalenessSeconds".

2016-10-25: Change minimum maxStalenessSeconds value from 2 * heartbeatFrequencyMS
to heartbeatFrequencyMS + idleWritePeriodMS (with proper conversions of course).

2016-11-01: Update formula for secondary staleness estimate with the
equivalent, and clearer, expression of this formula from the Max Staleness Spec

2016-11-21: Revert changes that would allow idleWritePeriodMS to change in the
future, require maxStalenessSeconds to be at least 90.

.. [#] mongos 3.4 refuses to connect to mongods with maxWireVersion < 5,
   so it does no additional wire version checks related to maxStalenessSeconds.