Master Leases for Berkeley DB
Susan LoVerso
sue@sleepycat.com
Rev 1.1
2007 Feb 2
What are Master Leases?
A master lease is a mechanism whereby clients grant master-ship rights
to a site and that master, by holding lease rights can provide a
guarantee of durability to a replication group for a given period of
time. By granting a lease to a master,
a client will not participate in an election to elect a new
master until that granted master lease has expired. By holding a
collection of granted leases, a master will be able to supply
authoritative read requests to applications. By holding leases a
read operation on a master can guarantee several things to the
application:
- Authoritative reads: a guarantee that the data being read by the
application is durable and can never be rolled back.
- Freshness: a guarantee that the data being read by the
application at the master is
not stale.
- Master viability: a guarantee that a current master with valid
leases will not encounter a duplicate master situation.
Requirements
The requirements of DB to support this include:
- After turning them on, users can choose to ignore them in reads
or not.
- We are providing read authority on the master only. A
read on a client is equivalent to a read while ignoring leases.
- We guarantee that data committed on a master that has been
read by an application on the
master will not be rolled back. Data read on a client or
while ignoring leases or data
successfully updated/committed but not read,
may be rolled back.
- A master will not return successfully from a read operation
unless it holds a
majority of leases unless leases are ignored.
- Master leases will remove the possibility of a current/correct
master being "shot down" by DUPMASTER. NOTE: Old/Expired
masters may discover a
later master and return DUPMASTER to the application however.
- Any send callback failure must result in premature lease
expiration on the master.
- Users who change the system clock during master leases void the
guarantee and may get undefined behavior. We assume time always
runs forward.
- Clients are forbidden from participating in elections while they
have an outstanding lease granted to another site.
- Clients are forbidden from accepting a new master while they have
an outstanding lease granted to another site.
- Clients are forbidden from upgrading themselves to master while
they have an outstanding lease granted to another site.
- When asked for a lease grant explicitly by the master, the client
cannot grant the lease to the master unless the LSN in the master's
request has been processed by this client.
The requirements of the
application using leases include:
- Users must implement (Base API users on their own, RepMgr users
via configuration) a majority (or larger) ACK policy.
- The application must use the election mechanism to decide a master.
It may not simply declare a site master.
- The send callback must return an error if the majority ACK policy
is not met for PERM records.
- Users must set the number of sites in the group.
- Using leases in a replication group is all-or-none.
Therefore, if a site knows it is using leases, it can assume other
sites are also.
- All applications that care about read guarantees must forward or
perform all reads on the master. Reading on the client means a
read ignoring leases.
There are some open questions
remaining.
- There is one major showstopper issue, see Crashing - Potential
problem near the end of the document. We need a better solution
than the one shown there (writing to disk every time a lease is
granted). Perhaps just documenting that durability means it must be
flushed to disk before success to avoid that situation?
- What about db->join? Users can call join, but the calls
on the join cursor to get the data would be subject to leases and
therefore protected. Ok, this is not an open question.
- What about other read-like operations? Clearly
DB->get, DB->pget, DBC->get,
DBC->pget need lease checks. However, other APIs use
keys. DB->key_range
provides an estimate only so it shouldn't need lease checks.
DB->stat provides exact counts
to bt_nkeys and bt_ndata fields. Are those
fields considered authoritative that providing those values implies a
durability guarantee and therefore DB->stat
should be subject to lease verification? DBC->count
provides a count for
the number of data items associated with a key. Is this
authoritative information? This is similar to stat - should it be
subject to lease verification?
- Do we require master lease checks on write operations? I
think lease checks are not needed on write operations. It doesn't
add correctness and adds a lot of complexity (checking leases in put,
del, and cursors, then what about rename, remove, etc).
- Do master leases give an iron-clad guarantee of never rolling
back a transaction? No, but it should mean that a committed transaction
can never be read on a master
unless the lease is valid. A committed transaction on a master
that has never been presented to the application may get rolled back.
- Do we need to quarantine or prevent reads on an ex-master until
sync-up is done? No. A master that is simply downgraded to
client or crashes and reboots is now a client. Reading from that
client is the same as saying Ignore Leases.
- What about adding and removing sites while leases are
active? This is SR 14778. A consistent nsites value
is required by master
leases. It isn't
clear to me what a master is
supposed to do if the value of nsites gets smaller while leases are
active. Perhaps it leaves its larger table intact and simply
checks for a smaller number of granted leases?
- Can users turn leases off? No. There is no planned turn
leases off API.
- Clock skew will be a percentage. However, the smallest, 1%,
is probably rather large for clock skew. Percentage was chosen
for simplicity and similarity to other APIs. What granularity is
appropriate here?
API Changes
The API changes that are visible
to the user are fairly minimal.
There are a few API calls they need to make to configure master leases
and then there is the API call to turn them on. There is also a
new flag to existing APIs to allow read operations to ignore leases and
return data that
may be non-durable potentially.
Lease Timeout
There is a new timout the user
must configure for leases called DB_REP_LEASE_TIMEOUT.
This timeout will be new to
the dbenv->rep_set_timeout method. The DB_REP_LEASE_TIMEOUT
has no default and it is required that the user configure a timeout
before they turn on leases (obviously, this timeout need not be set of
leases will not be used). That timeout is the amount of time
the lease is valid on the master and how long it is granted
on the client. This timeout must be the same
value on all sites (like log file size). The timeout used when
refreshing leases is the DB_REP_ACK_TIMEOUT
for RepMgr application. For Base API applications, lease
refreshes will use the same mechanism as PERM messages and they
should
have no additional burden. This timeout is used for lease
refreshment and is the amount of time a reader will wait to refresh
leases before returning failure to the application from a read
operation.
This timeout will be both stored
with its original value, and also
converted to a db_timespec
using the DB_TIMEOUT_TO_TIMESPEC
macro and have the clock skew accounted for and stored in the shared
rep structure:
db_timeout_t lease_timeout;
db_timespec lease_duration;
NOTE: By sending the lease refresh during DB operations, we are
forcing/assuming that the operation's process has a replication
transport function set. That is obviously the case for write
operations, but would it be a burden for read processes (on a
master)? I think mostly not, but if we need leases for
DB->stat then we need to
document it as it is certainly possible for an application to have a
separate or dedicated stat
application or attempt to use db_stat
(which will not work if leases must be checked).
Leases should be checked after the local operation so that we don't
have a window/boundary if we were to check leases first, get
descheduled, the lose our lease and then perform the operation.
Do the operation, then check leases before returning to the user.
Using Leases
There is a new API that the user must call to tell the system to use
the lease mechanism. The method must be called before the
application calls dbenv->rep_start
or dbenv->repmgr_start.
This new
method is:
dbenv->rep_set_lease(DB_ENV *dbenv, u_int32_t clock_scale_factor, u_int32_t flags)
The clock_scale_factor
parameter is interpreted as a percentage, greater than 100 (to transmit
a floating point number as an integer to the API) that represents the
maximum shkew between any two sites' clocks. That is, a clock_scale_factor of 150 suggests
that the greatest discrepancy between clocks is that one runs 50%
faster than the others. Both the
master and client sides
compensate for possible clock skew. The master uses the value to
compensate in case the replica has a slow clock and replicas compensate
in case they have a fast clock. This scaling factor will need to
be divided by 100 on all sites to truly represent the percentage for
adjustments made to time values.
Assume the slowest replica's clock is a factor of clock_scale_factor
slower than the
fastest clock. Using that assumption, if the fastest clock goes
from time t1 to t2 in X
seconds, the slowest clock does it in (clock_scale_factor / 100)
* X seconds.
The flags parameter is not
currently used.
When the dbenv->rep_set_lease
method is called, we will set a configuration flag indicating that
leases are turned on:
#define REP_C_LEASE <value>.
We will also record the u_int32_t
clock_skew value passed in. The rep_set_lease method
will not allow
calls after rep_start. If
multiple calls are made prior to calling rep_start then later
calls will
overwrite the earlier clock skew value.
We need a new flag to prevent calling rep_set_lease
after rep_start. The
simplest solution would be to reject the call to
rep_set_lease
if
REP_F_CLIENT
or REP_F_MASTER is set.
However that does not work in the cases where a site cleanly closes its
environment and then opens without running recovery. The
replication state will still be set. The prevention will be
implemented as:
#define REP_F_START_CALLED <some bit value>
In __rep_start, at the end:
if (ret == 0 ) {
REP_SYSTEM_LOCK
F_SET(rep, REP_F_START_CALLED)
REP_SYSTEM_UNLOCK
}
In __rep_env_refresh, if we
are the last reference closing the env (we already check for that):
F_CLR(rep, REP_F_START_CALLED);
In order to avoid run-time floating point operations
on db_timespec structures,
when a site is declared as a client or master in rep_start we
will pre-compute the
lease duration based on the integer-based clock skew and the
integer-based lease timeout. A master should set a replica's
lease expiration to the start time of
the sent message +
(lease_timeout / clock_scale_factor) in case the replica has a
slow clock. Replicas extend their leases to received message
time + (lease_timeout *
clock_scale_factor) in case this replica has a fast clock.
Therefore, the computation will be as follows if the site is becoming a
master:
db_timeout_t tmp;
tmp = (db_timeout_t)((double)rep->lease_timeout / ((double)rep->clock_skew / (double)100));
rep->lease_duration = DB_TIMEOUT_TO_TIMESPEC(&tmp);
Similarly, on a client the computation is:
tmp = (db_timeout_t)((double)rep->lease_timeout * ((double)rep->clock_skew / (double)100));
When a site changes state, its lease duration will change based on
whether it is becoming a master or client and it will be recomputed
from the original values. Note that these computations, coupled
with the fact that the lease on the master is computed based on the
master's time that it sent the message means that leases on the master
are more conservatively computed than on the clients.
The dbenv->rep_set_lease
method must be called after dbenv->open,
similar to dbenv->rep_set_config.
The reason is so that we can check that this is a replication
environment and we have access to the replication shared memory region.
Read Operations
Authoritative read operations on the master with leases enabled will
abide by leases by default. We will provide a flag that allows an
operation on a master to ignore leases. All read operations
on a client imply
ignoring leases. If an application wants authoritative reads
they must forward the read requests to the master and it is the
application's responsibility to provide the forwarding.
The consensus was that forcing DB_IGNORE_LEASE
on client read operations (with leases enabled, obviously) was too
heavy handed. Read operations on the client will ignore leases,
but do no special flag checking.
The flag will be called DB_IGNORE_LEASE
and it will be a flag that can be OR'd into the DB access method and
cursor operation values. It will be similar to the DB_READ_UNCOMMITTED
flag.
The methods that will
adhere to leases are:
- Db->get
- Db->pget
- Dbc->get
- Dbc->pget
The code that will check leases for a client reading would look
something
like this, if we decide to become heavy-handed:
if (IS_REP_CLIENT(dbenv)) {
[get to rep structure]
if (FLD_ISSET(rep->config, REP_C_LEASE) && !LF_ISSET(DB_IGNORE_LEASE)) {
db_err("Read operations must ignore leases or go to master");
ret = EINVAL;
goto err;
}
}
On the master, the new code to abide by leases is more complex.
After the call to perform the operation we will check the lease.
In that checking code, the master will see if it has a valid
lease. If so, then all is well. If not, it will try to
refresh the leases. If that refresh attempt results in leases,
all is well. If the refresh attempt does not get leases, then the
master cannot respond to the read as an authority and we return an
error. The new error is called DB_REP_LEASE_EXPIRED.
The location of the master lease check is down after the internal call
to read the data is successful:
if (IS_REP_MASTER(dbenv) && !LF_ISSET(DB_IGNORE_LEASE)) {
[get to rep structure]
if (FLD_ISSET(rep->config, REP_C_LEASE) &&
(ret = __rep_lease_check(dbenv)) != 0) {
/*
* We don't hold the lease.
*/
goto err;
}
}
See below for the details of __rep_lease_check.
Also note that if leases (or replication) are not configured, then DB_IGNORE_LEASE is a no-op. It
is ignored (and won't error) if used when leases are not in
effect. The reason is so that we can generically set that flag in
utility programs like db_dump
that walk the database with a cursor. Note that db_dump is the only utility that
reads with a cursor.
Nsites
and Elections
The call to dbenv->rep_set_nsites
must be performed before the call to dbenv->rep_start
or dbenv->repmgr_start.
This document assumes either that SR
14778 gets resolved, or assumes that the value of nsites is
immutable. The
master and all clients need to know how many sites and leases are in
the group. Clients need to know for elections. The master
needs to know for the size of the lease table and to know what value a
majority of the group is. [Until
14778 is resolved, the master lease work must assume nsites is
immutable and will
therefore enforce that this is called before rep_start using
the same mechanism
as rep_set_lease.]
Elections and leases need to agree on the number of sites in the
group. Therefore, when leases are in effect on clients, all calls
to dbenv->rep_elect must
set the nsites parameter to
0. The rep_elect code
path will return EINVAL if REP_C_LEASE is set and nsites
is non-0.
Lease Management
Message Changes
In order for clients to grant leases to the master a new message type
must be added for that purpose. This will be the REP_LEASE_GRANT
message.
Granting leases will be a result of applying a DB_REP_PERMANENT
record and therefore we
do not need any additional message in order for a master to request a
lease grant. The REP_LEASE_GRANT
message will pass a structure as its message DBT:
struct __rep_lease_grant {
db_timespec msg_time;
#ifdef DIAGNOSTIC
db_timespec expire_time;
#endif
} REP_GRANT_INFO;
In the REP_LEASE_GRANT
message, the client is actually giving the master several pieces of
information. We only need the echoed msg_time in this
structure because
everything else is already sent. The client is really sending the
master:
- Its EID (parameter to rep_send_message
and rep_process_message)
- The PERM LSN this message acknowledged (sent in the control
message)
- Unique identifier echoed back to master (msg_time sent in
message as above)
On the client, we always maintain the maximum PERM LSN already in lp->max_perm_lsn.
Local State Management
Each client must maintain a db_timespec
timestamp containing the expiration of its granted lease. This
field will be in the replication shared memory structure:
db_timespec grant_expire;
This timestamp already takes into account the clock skew. All
new fields must be initialized when the region is created. Whenever we
grant our master lease and want to send the REP_LEASE_GRANT
message, this value
will be updated. It will be used in the following way:
db_timespec mytime;
DB_LSN perm_lsn;
DBT lease_dbt;
REP_GRANT_INFO gi;
timespecclear(&mytime);
timespecclear(&newgrant);
memset(&lease_dbt, 0, sizeof(lease_dbt));
memset(&gi, 0, sizeof(gi));
__os_gettime(dbenv, &mytime);
timespecadd(&mytime, &rep->lease_duration);
MUTEX_LOCK(rep->clientdb_mutex);
perm_lsn = lp->max_perm_lsn;
MUTEX_UNLOCK(rep->clientdb_mutex);
REP_SYSTEM_LOCK(dbenv);
if (timespeccmp(mytime, rep->grant_expire, >))
rep->grant_expire = mytime;
gi.msg_time = msg->msg_time;
#ifdef DIAGNOSTIC
gi.expire_time = rep->grant_expire;
#endif
lease_dbt.data = &gi;
lease_dbt.size = sizeof(gi);
REP_SYSTEM_UNLOCK(dbenv);
__rep_send_message(dbenv, eid, REP_LEASE_GRANT, &perm_lsn, &lease_dbt, 0, 0);
This updating of the lease grant will occur in the PERM code
path when we have
successfully applied the permanent record.
Maintaining Leases on the
Master/Rep_start
The master maintains a lease table that it checks when fulfilling a
read request that is subject to leases. This table is initialized
when a site calls
dbenv->rep_start(DB_MASTER) and the site is undergoing a role
change (i.e. a master making additional calls to dbenv->rep_start(DB_MASTER)
does
not affect an already existing table).
When a non-master site becomes master, it must do two things related to
leases on a role change. First, a client cannot upgrade to master
while it has an outstanding lease granted to another site. If a
client attempts to do so, an error, EINVAL,
will be returned. The only way this should happen is if the
application simply declares a site master, instead of using
elections. Elections will already wait for leases to expire
before proceeding. (See below.)
Second, once we are proceeding with becoming a master, the site must
allocate the table it will use to maintain lease information.
This table will be sized based on nsites
and it will be an array of the following structure:
struct {
int eid; /* EID of client site. */
db_timespec start_time; /* Unique time ID client echoes back on grants. */
db_timespec end_time; /* Master's lease expiration time. */
DB_LSN lease_lsn; /* Durable LSN this lease applies to. */
u_int32_t flags; /* Unused for now?? */
} REP_LEASE_ENTRY;
Granting Leases
It is the burden of the application to make sure that all sites in the
group
are using leases, or none are. Therefore, when a client processes
a PERM
log record that arrived from the master, it will grant its lease
automatically if that record is permanent (i.e. DB_REP_ISPERM
is being returned),
and leases are configured. A client will not send a
lease grant when it is processing log records (even PERM
ones) it receives from other clients that use client-to-client
synchronization. The reason is that the master requires a unique
time-of-msg ID (see below) that the client echoes back in its lease
grant and it will not have such an ID from another client.
The master stores a time-of-msg ID in each message and the client
simply echoes it back to the master. In its lease table, it does
keep the base
time-of-msg for a valid lease. When REP_LEASE_GRANT
message comes in,
the master does a number of things:
- Pulls the echoed timespec from the client message, into msg_time.
- Finds the entry in its lease table for the client's EID. It
walks the table searching for the ID. EIDs of DB_EID_INVALID are
illegal. Either the master will find the entry, or it will find
an empty slot in the table (i.e. it is still populating the table with
leases).
- If this is a previously unknown site lease, the master
initializes the entry by copying to the eid, start_time, and
lease_lsn fields. The master
also computes the end_time
based on the adjusted rep->lease_duration.
- If this is a lease from a previously known site, the master must
perform timespeccmp(&msg_time,
&table[i].start_time, >) and only update the end_time
of the lease when this is
a more recent message. If it is a more recent message, then we
should update
the lease_lsn to the LSN in
the message.
- Since lease durations are computed taking the clock skew into
account, clients compute them based on the current time and the master
computes it based on original sending time, for diagnostic purposes
only, I also plan to send the client's expiration time. The
client errs on the side of computing a larger lease expiration time and
the master errs on the side of computing a smaller duration.
Since both are taking the clock skew
into account, the client's ending expiration time should never be
smaller than
the master's computed expiration time or their value for clock skew may
not be correct.
Any log records (new or resent) that originate from the master and
result in DB_REP_ISPERM get an
ack.
Refreshing Leases
Leases get refreshed when a master receives a REP_LEASE_GRANT
message from a client. There are three pieces to lease
refreshment.
Lazy Lease Refreshing on Read
If the master discovers that leases are
expired during the read operation, it attempts to refresh its
collection of lease grants. It does this by calling a new
function __rep_lease_refresh.
This function is very similar to the already-existing function __rep_flush.
Basically, to
refresh the lease, the master simply needs to resend the last PERM
record to the clients. The requirements state that when the
application send function returns successfully from sending a PERM
record, the majority of clients have that PERM LSN durable. We
will have a new public DB error return called DB_REP_LEASE_EXPIRED
that will be
returned back to the caller if the master cannot assert its
authority. The code will look something like this:
/*
* Use lp->max_perm_lsn on the master (currently not used on the master)
* to keep track of the last PERM record written through the logging system.
* need to initialize lp->max_perm_lsn in rep_start on role_chg.
*/
call __rep_send_message on the last PERM record the master wrote, with DB_REP_PERMANENT
if failure
expire leases
return lease expired error to caller
else /* success */
recheck lease table
/*
* We need to recheck the lease table because the client
* lease grant messages may not be processed yet, or got
* lost, or racing with the application's ACK messages or
* whatever.
*/
if we have a majority of valid leases
return success
else
return lease expired error to caller
Ongoing Update Refreshment
Second is having the master indicate to
the client it needs to send a lease grant in response to the current
PERM log message. The problem is
that acknowledgements must contain a master-supplied message timestamp
that the client sends back to the master. We need to modify the
structure of the log record messages when leases are configured
so
that when a PERM message is sent, the master sends, and the client
expects, the message timestamp. There are three fairly
straightforward and different implementations to consider.
- Adding the timestamp to the REP_CONTROL
structure. If this option is chosen, then the code trivially
sends back the timestamp in the client's reply. There is no
special processing done by either side with the message contents.
So, on a PERM log record, the master will send a non-zero
timestamp. On a normal log record the timestamp will be zero or
some known invalid value. If the client sees a non-zero
timestamp, it sends a REP_LEASE_GRANT
with the lp->max_perm_lsn
after applying that log record. If it is zero, then the client
does nothing different. The advantage is ease of code. The
disadvantage is that for mixed version systems, the client is now
dealing with different sized control structures. We would have to
retain the old control structure so that during a mixed version group
the (upgraded) clients can use, expect and send old control structures
to the master. This is unfortunate, so let's consider additional
implementations that don't require modifying the control structure.
- Adding a new REPCTL_LEASE
flag to the list of flags for the control structure, but do not change
the control structure fields. When a master wants to send a
message that needs a lease ack, it sets the flag. Additionally,
instead of simply sending a log record DBT as the rec parameter
for replication, we
would send a new structure that had the timestamp first and then the
record (similar to the bulk transfer buffer). The advantage of
this is that the control structure does not change. Disadvantages
include more special-cased code in the normal code path where we have
to check the flag. If the flag is set we have to extract the
timestamp value and massage the incoming data to pass on the real log
record to rep_apply. On
bulk transfer, we would just add the timestamp into the buffer.
On normal transfers, it would incur an additional data copy on the
master side. That is unfortunate. Additionally, if this
record needs to be stored in the temp db, we need some way to get it
back again later or rep_apply
would have to extract the timestamp out when it processed the record
(either live or from the temp db).
- Adding a different message type, such as REP_LOG_ACK.
Similarly to REP_LOG_MORE this message would be a
special-case version of a log record. We would extract out the
timestamp and then handle as a normal log record. This
implementation is rejected because it actually would require three new
message types: REP_LOG_ACK,
REP_LOG_ACK_MORE, REP_BULK_LOG_ACK. That is just too ugly
to contemplate.
[Slight digression: it occurs
to me while writing about #2 and #3 above, that our implementation of
all of the *_MORE messages could really be implemented with a REPCTL_MORE
flag instead of a
separate message type. We should clean that up and simplify the
messages but not part of master leases. Hmm, taking that thought
process further, we really could get rid of the REP_BULK_*
messages as well if we
added a REPCTL_BULK
flag. I think we should definitely do it for the *_MORE
messages. I am not sure we should do it for bulk because the
structure of the incoming data record is vastly different.]
Of these options, I believe that modifying the control structure is the
best alternative. The handling of the old structure will be very
isolated to code dealing with old versions and is far less complicated
than injecting the timestamp into the log record DBT and doing a data
copy. Actually, I will likely combine #1 and the flag from #2
above. I will have the REPCTL_LEASE
flag that indicates a lease grant reply is expected and have the
timestamp in the control structure.
Also I will probably add in a spare field or two for future use in the REP_CONTROL
structure.
Gap processing
No matter which implementation we choose for ongoing lease refreshment,
gap processing must be considered. The code above assumes the
timestamps will be placed on PERM records only. Normal log
records will not have a timestamp, nor a flag or anything else like
that. However, any log message can fill a gap on a client and
result in the processing of that normal log record to return DB_REP_ISPERM
because later records
were also processed.
The current implementation should work fine in that case because when
we store the message in the client temp db we store both the control
DBT and the record DBT. Therefore, when a normal record fills a
gap, the later PERM record, when retrieved will look just like it did
when it arrived. The client will have access to the LSN, and the
timestamp, etc. However, it does mean that sending the REP_LEASE_GRANT
message must take
place down in __rep_apply
because that is the only place we have access to the contents of those
stored records with the timestamps.
There are two logical choices to consider for granting the lease when
processing an update. As we process (either a live record or one
read from the temp db after filling a gap) a PERM message, we send the REP_LEASE_GRANT
message for each
PERM record we successfully apply. Or, second, we keep track of
the largest timestamp of all PERM records we've processed and at the
end of the function after we've applied all records, we send back a
single lease grant with the max_perm_lsn
and a new max_lease_timestamp
value to the master. The first is easier to implement, the second
results in possibly slightly fewer messages at the expense of more
bookkeeping on the client.
A third, more complicated option would be to have the message timestamp
on all records, but grants are only sent on the PERM messages. A
reason to do this is that the later timestamp of a normal log record
would be used as the timestamp sent in the reply and the master would
get a more up to date timestamp value and a longer lease.
If we change the REP_CONTROL
structure to include the timestamp, we potentially break or at least
need to revisit the gap processing algorithm. That code assumes
that the control and record elements for the same LSN look the same
each and every time. The code stores the control DBT as the key and the rec DBT as the data. We use a
specialized compare function to sort based on the LSN in the control
DBT. With master leases, the same record transmitted by a master
multiple times or client for the same LSN will be different because the
timestamp field will not be the same. Therefore, the client will
end up with duplicate entries in the temp database for the same
LSN. Both solutions (adding the timestamp to REP_CONTROL and adding a REPCTL_LEASE flag) can yield
duplicate entries. The flag would cause the same record from the
master and client to be different as well.
Handling Incoming Lease Grants
The third piece of lease management is handling the incoming REP_LEASE_GRANT
message on the
master. When this message is received, the master must do the
following:
REP_SYSTEM_LOCK
msg_timestamp = cntrl->timestamp;
client_lease = __rep_lease_entry(dbenv, client eid)
if (client_lease == NULL)
initial lease for this site, DB_ASSERT there is space in the table
add this to the table if there is space
} else
compare msg_timestamp with client_lease->start_time
if (msg_timestamp is more recent && msg_lsn >= lease LSN)
update entry in table
REP_SYSTEM_UNLOCK
Expiring Leases
Leases can expire in two ways. First they can expire naturally
due to the passage of time. When checking leases, if the current
time is later than the lease entry's end_time
then the lease is expired. Second, they can be forced with a
premature expiration when the application's transport function returns
an error. In the first case, there is nothing to do, in the
second case we need to manipulate the end_time
so that all future lease checks fail. Since the lease start_time
is guaranteed to not be in the future we will have a function __rep_lease_expire
that will:
REP_SYSTEM_LOCK
for each entry in the lease table
entry->end_time = entry->start_time;
REP_SYSTEM_UNLOCK
Is there a potential race or problem with prematurely expiring
leases? Consider an application that enforces an ALL
acknowledgement policy for PERM records in its transport
callback. There are four clients and three send the PERM ack to
the application. The callback returns an error to the master DB
code. The DB code will now prematurely expire its leases.
However, at approximately the same time the three clients are also
sending their REP_LEASE_GRANT
messages to the master. There is a race between the master
processing those messages and the thread handling the callback failure
expiring the table. This is only an issue if the messages arrive
after the table has been expired.
Let's assume all three clients send their grants after the master
expires the table. If we accept those grants and then a read
occurs the read will succeed since the master has a majority of leases
even though the callback failed earlier. Is that a problem?
The lease code is using a majority and the application policy is using
something other value. It feels like this should be okay since
the data is held by leases on a majority. Should we consider
having the lease checking threshold be the same as the permanent ack
policy? That is difficult because Base API users implement
whatever they want and DB does not know what it is.
Checking Leases
When a read operation on the master completes, the last thing we need
to do is verify the master leases. We've already discussed
refreshing them when they are expired above. We need two things
for a lease to be valid. It must be within the timeframe of the
lease grant and the lease must be valid for the last PERM record
LSN. Here is the logic
for checking the validity of leases in __rep_lease_check:
#define MAX_REFRESH_TRIES 3
DB_LSN lease_lsn;
REP_LEASE_ENTRY *entry;
u_int32_t min_leases, valid_leases;
db_timespec cur_time;
int ret, tries;
tries = 0;
retry:
ret = 0;
LOG_SYSTEM_LOCK
lease_lsn = lp->lsn
LOG_SYSTEM_UNLOCK
REP_SYSTEM_LOCK
min_leases = rep->nsites / 2;
__os_gettime(dbenv, &cur_time);
for (entry = head of table, valid_leases = 0; entry != NULL && valid_leases < min_leases; entry++)
if (timespec_cmp(&entry->end_time, &cur_time) >= 0 && log_compare(&entry->lsn, lease_lsn) == 0)
valid_leases++;
REP_SYSTEM_UNLOCK
if (valid_leases < min_leases) {
ret =__rep_lease_refresh(dbenv, ...);
/*
* If we are successful, we need to recheck the leases because
* the lease grant messages may have raced with the PERM
* acknowledgement. Give those messages a chance to arrive.
*/
if (ret == 0) {
if (tries <= MAX_REFRESH_TRIES) {
/*
* If we were successful sending, but not successful in racing the
* message thread, yield the processor so that message
* threads may have a chance to run.
*/
if (tries > 0)
/* __os_sleep instead?? */
__os_yield()
tries++;
goto retry;
} else
ret = DB_RET_LEASE_EXPIRED;
}
}
return (ret);
If the master has enough valid leases it returns success. If it
does not have enough, it attempts to refresh them. This attempt
may fail if sending the PERM record does not receive sufficient
acks. If we do receive sufficient acknowledgements we may still
find that scheduling of message threads means the master hasn't yet
processed the incoming REP_LEASE_GRANT
messages yet. We will retry a couple times (possibly
parameterized) if the master discovers that situation.
Elections
When a client grants a lease to a master, it gives up the right to
participate in an election until that grant expires. If we are
the master and dbenv->rep_elect
is called, it should return, no matter what, like it does today.
If we are a client and rep_elect
is called special processing takes place when leases are in
effect. First, the easy case is if the lease granted by this
client has already expired, then the client goes directly into the
election as normal. If a valid lease grant is outstanding to a
master, this site cannot participate in an election until that grant
expires. We have at least two options when a site calls the dbenv->rep_elect
API while
leases are in effect.
- The simplest coding solution for DB would be simply to refuse to
participate in the election if this site has a current lease granted to
a master. We would detect this situation and return EINVAL.
This is correct behavior and trivial to implement. The
disadvantage of this solution is that the application would then be
responsible for repeatedly attempting an election until the lease grant
expired.
- The more satisfying solution is for DB to wait the remaining time
for the grant. If this client hears from the master during that
time the election does not take place and the call to rep_elect
returns with the
information for the current/old master.
Election Code Changes
The code changes to support leases in the election code are fairly
isolated. First if leases are configured, we must verify the nsites
parameter is set to 0.
Second, in __rep_elect_init
we must not overwrite the value of rep->nsites
for leases because it is controlled by the dbenv->rep_set_nsites
API.
These changes are small and easy to understand.
The more complicated code will be the client code when it has an
outstanding lease granted. The client will wait for the current
lease grant to expire before proceeding with the election. The
client will only do so if it does not hear from the master for the
remainder of the lease grant time. If the client hears from the
master, it returns and does not begin participating in the
election. A new election phase, REP_EPHASE0
will exist so that the call to __rep_wait
can detect if a master responds. The client, while waiting for
the lease grant to expire, will send a REP_MASTER_REQ
message so that the master will respond with a REP_NEWMASTER
message and thus,
allow the client to know the master exists. However, it is also
desirable that if the master
replies to the client, the master wants the client to update its lease
grant.
Recall that the REP_NEWMASTER
message does not result in a lease grant from the client. The
client responds when it processes a PERM record that has the REPCTL_LEASE
flag set in the message
with its lease grant up to the given LSN. Therefore, we want the
client's REP_MASTER_REQ to
yield both the discovery of the existing master and have the master
refresh its leases. The client will also use the REPCTL_LEASE
flag in its REP_MASTER_REQ message to the
master. This flag will serve as the indicator to the master that
it needs to deal with leases and both send the REP_NEWMASTER
message and refresh
the lease.
The code will work as follows:
if (leases_configured && (my_grant_still_valid || lease_never_granted) {
if (lease_never_granted)
wait_time = lease_timeout
else
wait_time = grant_expiration - current_time
F_SET(REP_F_EPHASE0);
__rep_send_message(..., REP_MASTER_REQ, ... REPCTL_LEASE);
ret = __rep_wait(..., REP_F_EPHASE0);
if (we found a master)
return
} /* if we don't return, fall out and proceed with election */
On the master side, the code handling the REP_MASTER_REQ will
do:
if (I am master) {
...
__rep_send_message(REP_NEWMASTER...)
if (F_ISSET(rp, REPCTL_LEASE))
__rep_lease_refresh(...)
}
Other minor implementation details are that __rep_elect_done
must also clear
the REP_F_EPHASE0 flag.
We also, obviously, need to define REP_F_EPHASE0
in the list of replication flags. Note that the client's call to __rep_wait
will return upon
receiving the REP_NEWMASTER
message. The client will independently refresh its lease when it
receives the log record from the master's call to refresh the lease.
Again, similar to what I suggested above, the code could simply assume
global leases are configured, and instead of having the REPCTL_LEASE
flag at all, the master
assumes that it needs to refresh leases because it has them configured,
not because it is specified in the REP_MASTER_REQ
message it is processing. Right now I don't think every possible
REP_MASTER_REQ message should result in a lease grant request.
Elections and Quiescient Systems
It is possible that a master is slow or the client is close to its
expiration time, or that the master is quiescient and all leases are
currently expired, but nothing much is going on anyway, yet some client
calls __rep_elect at that
time. In the code above, we will not send the REP_MASTER_REQ
because the lease is
not valid. The client will simply proceed directly to sending the
REP_VOTE1 message, throwing all
other clients into an election. The master is still master and
should stay that way. Currently in response to a vote message, a
master will broadcast out a REP_NEWMASTER
to assert its mastership. That causes the election to
complete. However, if desired the master may want to proactively
refresh its leases. This situation indicates to me that the
master should choose to refresh leases based on configuration, not a
flag sent from the client. I believe anytime the master asserts
its mastership via sending a REP_NEWMASTER
message that I need to add code to proactively refresh leases at that
time.
Other Implementation Details
Role Changes
When a site changes its role via a call to rep_start in either
direction, we
must take action when leases are configured. There are three
types of role changes that all need changes to deal with leases:
- A master downgrading to a
client. When a master downgrades to a client, it can do so
immediately after it has proactively expired all existing leases it
holds. This situation is similar to an error from the send
callback, and it effectively cancels all outstanding leases held on
this site. Note that if this master expires its leases, it does
not have any effect on when the clients' lease grants expire on the
client side. The clients must still wait their full expected
grant time.
- A client upgrading to master.
If a client is upgrading to a master but it has an outstanding lease
granted to another site, the code will return an EINVAL
error. This situation
only arises if the application simply declares this site master.
If a site wins an election then the election itself should have waited
long enough for the granted lease to expire and this state should not
arise then.
- A client finding a new master.
When a client discovers a new and different master, via a REP_NEWMASTER
message then the
client cannot accept that new master until its current lease grant
expires. This situation should only occur when a site declares
itself master without an election and that site's lease grant expires
before this client's grant expires. However, it is possible
for this situation to arise
with elections also. If we have 5 sites holding an election and 4
of those sites have leases expire at about the same time T, and this
site's lease expires at time T+N and the election timeout is < N,
then those 4 sites may hold an election and elect a master without this
site's participation. A client in this situation must call __rep_wait
with the time remaining
on its lease. If the lease is expired after waiting the remaining
time, then the client can accept this new master. If the lease
was refreshed during the waiting period then the client does not accept
this new master and returns.
DUPMASTER
A duplicate master situation can occur if an old master becomes
disconnected from the rest of the group, that group elects a new master
and then the partition is resolved. The requirement for master
leases is that this situation will not cause the newly elected,
rightful master to receive the DB_REP_DUPMASTER
return. It is okay for the old master to get that return
value. When a dual master situation exists, the following will
happen:
- On the current master and all
current clients - If the current master receives an update
message or other conflicting message from the old master then that
message will be ignored because the generation number is out of date.
- On the old master - If
the old master receives an update message from the current master, or
any other message with a later generation from any site, the new
generation number will trigger this site to return DB_REP_DUPMASTER.
However,
instead of broadcasting out the REP_DUPMASTER
message to shoot down others as well, this site, if leases are
configured, will call __rep_lease_check
and if they are expired, return the error. It should be
impossible for us to receive a later generation message and still hold
a majority of master leases. Something is seriously wrong and we
will DB_ASSERT this situation
cannot happen.
Client to Client Synchronization
One question to ask is how lease grants interact with client-to-client
synchronization. The only answer is that they do not. A client
that is sending log records to another client cannot request the
receiving client refresh its lease with the master. That client
does not have a timestamp it can use for the master and clock skew
makes it meaningless between machines. Therefore, sites that use
client-to-client synchronization will likely see more lease refreshment
during the read path and leases will be refreshed during live updates
only. Of course, if a client supplies log records that fill a
gap, and the later log records stored came from the master in a live
update then the client will respond as per the discussion on Gap
Processing above.
Interaction Matrix
If leases are granted (by a client) or held (by a master) what should
the following APIs and messages do?
Other:
log_archive: Leases do not affect log_archive. OK.
dbenv->close: OK.
crash during lease grant and restart: Potential
problem here. See discussion below.
Rep Base API method:
rep_elect: Already discussed above. Must wait for lease to expire.
rep_flush: Master only, OK - this will be the basis for refreshing
leases.
rep_get_*: Not affected by leases.
rep_process_message: Generally OK. We'll discuss each message
below.
rep_set_config: OK.
rep_set_limit: OK
rep_set_nsites: Must be called before rep_start
and nsites is immutable until
14778 is resolved.
rep_set_priority: OK
rep_set_timeout: OK. Used to set lease timeout.
rep_set_transport: OK.
rep_start(MASTER): Role changes are discussed above. Make sure
duplicate rep_start calls are no-ops for leases.
rep_start(CLIENT): Role changes are discussed above. Make sure
duplicate calls are no-ops for leases.
rep_stat: OK.
rep_sync: Should not be able to happen. Client cannot accept new
master with outstanding lease grant. Add DB_ASSERT here.
REP_ALIVE: OK.
REP_ALIVE_REQ: OK.
REP_ALL_REQ: OK.
REP_BULK_LOG: OK. Clients check to send ACK.
REP_BULK_PAGE: Should never process one with lease granted. Add
DB_ASSERT.
REP_DUPMASTER: Should never happen, this is what leases are supposed to
prevent. See above.
REP_LOG: OK. Clients check to send ACK.
REP_LOG_MORE: OK. Clients check to send ACK.
REP_LOG_REQ: OK.
REP_MASTER_REQ: OK.
REP_NEWCLIENT: OK.
REP_NEWFILE: OK. Clients check to send ACK.
REP_NEWMASTER: See above.
REP_NEWSITE: OK.
REP_PAGE: OK. Should never process one with lease granted.
Add DB_ASSERT.
REP_PAGE_FAIL: OK. Should never process one with lease
granted. Add DB_ASSERT.
REP_PAGE_MORE: OK. Should never process one with lease
granted. Add DB_ASSERT.
REP_PAGE_REQ: OK.
REP_REREQUEST: OK.
REP_UPDATE: OK. Should never process one with lease
granted. Add DB_ASSERT.
REP_UPDATE_REQ: OK. This is a master-only message.
REP_VERIFY: OK. Should never process one with lease
granted. Add DB_ASSERT.
REP_VERIFY_FAIL: OK. Should never process one with lease
granted. Add DB_ASSERT.
REP_VERIFY_REQ: OK.
REP_VOTE1: OK. See Election discussion above. It is
possible to receive one with a lease granted. Client cannot send
one with an outstanding lease however.
REP_VOTE2: OK. See Election discussion above. It is
possible to receive one with a lease granted.
If the following method or message processing is in progress and a
client wants to grant a lease, what should it do? Let's examine
what this means. The client wanting to grant a lease simply means
it is responding to the receipt of a REP_LOG
(or its variants) message and applying a log record. Therefore,
we need to consider a thread processing a log message racing with these
other actions.
Other:
log_archive: OK.
dbenv->close: User error. User should not be closing the env
while other threads are using that handle. Should have no effect
if a 2nd dbenv handle to same env is closed.
Rep Base API method:
rep_elect: See Election discussion above. rep_elect
should wait and may grant
lease while election is in progress.
rep_flush: Should not be called on client.
rep_get_*: OK.
rep_process_message: Generally OK. See handling each message
below.
rep_set_config: OK.
rep_set_limit: OK.
rep_set_nsites: Must be called before rep_start
until 14778 is resolved.
rep_set_priority: OK.
rep_set_timeout: OK.
rep_set_transport: OK.
rep_start(MASTER): OK, can't happen - already protect racing rep_start
and rep_process_message.
rep_start(CLIENT): OK, can't happen - already protect racing rep_start
and rep_process_message.
rep_stat: OK.
rep_sync: Shouldn't happen because client cannot grant leases during
sync-up. Incoming log message ignored.
REP_ALIVE: OK.
REP_ALIVE_REQ: OK.
REP_ALL_REQ: OK.
REP_BULK_LOG: OK.
REP_BULK_PAGE: OK. Incoming log message ignored during internal
init.
REP_DUPMASTER: Shouldn't happen. See DUPMASTER discussion above.
REP_LOG: OK.
REP_LOG_MORE: OK.
REP_LOG_REQ: OK.
REP_MASTER_REQ: OK.
REP_NEWCLIENT: OK.
REP_NEWFILE: OK.
REP_NEWMASTER: See above. If a client accepts a new master
because its lease grant expired, then that master sends a message
requesting the lease grant, this client will not process the log record
if it is in sync-up recovery, or it may after the master switch is
complete and the client doesn't need sync-up recovery. Basically,
just uses existing log record processing/newmaster infrastructure.
REP_NEWSITE: OK.
REP_PAGE: OK. Receiving a log record during internal init PAGE
phase should ignore log record.
REP_PAGE_FAIL: OK.
REP_PAGE_MORE: OK.
REP_PAGE_REQ: OK.
REP_REREQUEST: OK.
REP_UPDATE: OK. Receiving a log record during internal init
should ignore log record.
REP_UPDATE_REQ: OK - master-only message.
REP_VERIFY: OK. Receiving a log record during verify phase
ignores log record.
REP_VERIFY_FAIL: OK.
REP_VERIFY_REQ: OK.
REP_VOTE1: OK. This client is processing someone else's vote when
the lease request comes in. That is fine. We protect our
own election and lease interaction in __rep_elect.
REP_VOTE2: OK.
Crashing - Potential Problem
It appears there is one area where we could have a problem. I
believe that crashes can cause us to break our guarantee on durability,
authoritative reads and inability to elect duplicate masters.
Consider this scenario:
- A master and 4 clients are all up and running.
- The master commits a txn and all 4 clients refresh their lease
grants at time T.
- All 4 clients have the txn and log records in the cache.
None are flushing to disk.
- All 4 clients have responded to the PERM messages as well as
refreshed their lease with the master.
- All 4 clients hit the same application coding error and crash
(machine/OS stays up).
- Master authoritatively reads data in txn from step 2.
- All 4 clients restart the application and run recovery, thus the
txn from step 2 is lost on all clients because it isn't any logs.
- A network partition happens and the master is alone on its side.
- All 4 clients are on the other side and elect a new master.
- Partition resolves itself and we have duplicate masters, where
the former master still holds all valid lease grants.
Therefore, we have broken both guarantees. In step 6 the data is
really not durable and we've given it to the user. One can argue
that if this is an issue the application better be syncing somewhere if
they really want durability. However, worse than that is that we
have a legitimate DUPMASTER situation in step 10 where both masters
hold valid leases. The reason is that all lease knowledge is in
the shared memory and that is lost when the app restarts and runs
recovery.
How can we solve this? The obvious solution is (ugh, yet another)
durable BDB-owned file with some information in it, such as the current
lease expiration time so that rebooting after a crash leaves the
knowledge that the lease was granted. However, writing and
syncing every lease grant on every client out to disk is far too
expensive.
A second possible solution is to have clients wait a full lease timeout
before entering an election the first time. This solution solves the
DUPMASTER issue, but not the non-authoritative read. This
solution naturally falls out of elections and leases really. If a
client has never granted a lease, it should be considered as having to
wait a full lease timeout before entering an election.
Applications already know that leases impact elections and this does
not seem so bad as it is only on the first election.
Is it sufficient to document that the authoritative read is only as
authoritative as the durability guarantees they make on the sites that
indicate it is permanent? Yes, I believe this is sufficient. If
the application says it is permanent and it really isn't, then the
application is at fault. Believing the application when it
indicates with the PERM response that it is permanent avoids the
authoritative problem.
Upgrade/Mixed Versions
Clearly leases cannot be used with mixed version sites since masters
running older releases will not have any knowledge of lease
support. What considerations are needed in the lease code for
mixed versions?
First if the REP_CONTROL
structure changes, we need to maintain and use an old version of the
structure for talking to older clients and masters. The
implementation of this would be similar to the way we manage for old REP_VOTE_INFO
structures.
Second any new messages need translation table entries added.
Third, if we are assuming global leases then clearly any mixed versions
cannot have leases configured, and leases cannot be used in mixed
version groups. Maintaining two versions of the control structure
is not necessary if we choose a different style of implementation and
don't change the control structure.
However, then how could an old application both run continuously,
upgrade to the new release and take advantage of leases without taking
down the entire application? I believe it is possible for clients
to be configured for leases but be subject to the master regarding
leases, yet the master code can assume that if it has leases
configured, all client sites do as well. In several places above
I suggested that a client could make a choice based on either a new REPCTL_LEASE
flag or simply having
leases turned on locally. If we choose to use the flag, then we
can support leases with mixed versions. The upgraded clients can
configure leases and they simply will not be granted until the old
master is upgraded and send PERM message with the flag indicating it
wants a lease grant. The client will not grant a lease until such
time. The clients, while having the leases configured, will not
grant a lease until told to do so and will simply have an expired
lease. Then, when the old master finally upgrades, it too can
configure leases and suddenly all sites are using them. I believe
this should work just fine and I will need to make sure a client's
granting of leases is only in response to the master asking for a
grant. If the master never asks, then the client has them
configured, but doesn't grant them.
Testing
Clearly any user-facing API changes will need the equivalent reflection
in the Tcl API for testing, under CONFIG_TEST.
I am sure the list of tests will grow but off the top of my head:
Basic test: have N sites all configure leases, run some, read on
master, etc.
Refresh test: Perform update on master, sleep until past expiration,
read on master and make sure leases are refreshed/read successful
Error test: Test error conditions (reading on client with leases but no
ignore flag, calling after rep_start, etc)
Read test: Test reading on both client and master both with and without
the IGNORE flag. Test that data read with the ignore flag can be
rolled back.
Dupmaster test: Force a DUPMASTER situation and verify that the newer
master cannot get DUPMASTER error.
Election test: Call election while grant is outstanding and master
exists.
Call election while grant is outstanding and master does not exist.
Call election after expiration on quiescient system with master
existing.
Run with a group where some members have leases configured and other do
not to make sure we get errors instead of dumping core.