---
sidebar_position: 4
---

# State Sync

State sync synchronizes the state of a lagging replica with the healthy cluster.

State sync is used when when a lagging replica's log no longer intersects with the cluster's current log —
[WAL repair](./vsr.md#protocol-repair-wal) cannot catch the replica up.

(VRR refers to state sync as "state transfer", but we already have [transfers](../reference/transfers.md) elsewhere.)

In the context of state sync, "state" refers to:
1. the superblock `vsr_state.checkpoint`
2. the grid (manifest, free set, and client sessions blocks)
3. the grid (LSM table data; acquired blocks only)
4. client replies

State sync consists of four protocols:
- [Sync Superblock](./vsr.md#protocol-sync-superblock) (syncs 1)
- [Repair Grid](./vsr.md#protocol-repair-grid) (syncs 2)
- [Sync Forest](./vsr.md#protocol-sync-forest) (syncs 3)
- [Sync Client Replies](./vsr.md#protocol-sync-client-replies) (syncs 4)

The target of superblock-sync is the latest checkpoint of the healthy cluster.
When we catch up to the latest checkpoint (or very close to it), then we can transition back to a healthy state.

## Glossary

Replica roles:

- _syncing replica_: A replica performing superblock-sync. (Any step within *1*-*10* of the [sync algorithm](#algorithm))
- _healthy replica_: A replica _not_ performing superblock-sync — part of the active cluster.
- _divergent replica_: A replica with a checkpoint that is (and can never be) canonical.

Checkpoints:

- [_checkpoint id_/_checkpoint identifier_](#checkpoint-identifier): Uniquely identifies a particular checkpoint reproducibly across replicas.
- [_sync target_](#sync-target): The checkpoint identifier of the target of superblock-sync.

## Algorithm

0. [Sync is needed](#0-scenarios).
1. [Trigger sync](#1-triggers).
2. Wait for non-grid commit operation to finish.
3. Wait for grid IO to finish. (See `Grid.cancel()`.)
4. Wait for a usable sync target to arrive. (Usually we already have one.)
5. Begin [sync-superblock protocol](./vsr.md#protocol-sync-superblock).
6. [Request superblock checkpoint state](#6-request-superblock-checkpoint-state).
7. Update the superblock headers with:
    - Bump `vsr_state.checkpoint.header` to the sync target header.
    - Bump `vsr_state.checkpoint.parent_checkpoint_id` to the checkpoint id that is previous to our sync target (i.e. it isn't _our_ previous checkpoint).
    - Bump `replica.commit_min`. (If `replica.commit_min` exceeds `replica.op`, transition to `status=recovering_head`).
    - Set `vsr_state.sync_op_min` to the minimum op which has not been repaired.
    - Set `vsr_state.sync_op_max` to the maximum op which has not been repaired.
8. Sync-superblock protocol is done.
9. Repair [replies](./vsr.md#protocol-sync-client-replies), [free set, client sessions, and manifest blocks](./vsr.md#protocol-repair-grid), and [table blocks](./vsr.md#protocol-sync-forest) that were created within the `sync_op_{min,max}` range.
10. Update the superblock with:
    - Set `vsr_state.sync_op_min = 0`
    - Set `vsr_state.sync_op_max = 0`

If a newer sync target is discovered during steps *5*-*6* or *9*, go to step *4*.

If the replica starts up with `vsr_state.sync_op_max ≠ 0`, go to step *9*.

### 0: Scenarios

Scenarios requiring state sync:

1. A replica was down/partitioned/slow for a while and the rest of the cluster moved on.
  The lagging replica is too far behind to catch up via WAL repair.
2. A replica was just formatted and is being added to the cluster (i.e. via [reconfiguration](./vsr.md#protocol-reconfiguration)).
  The new replica is too far behind to catch up via WAL repair.

Causes of number 3:
- A storage determinism bug.
- An upgraded replica (e.g. a canary) running a different version of the code from the remainder of the cluster, which unexpectedly changes its history.
  (The change either has a bug or should have been gated behind a feature flag.)

### 1: Triggers

State sync is initially triggered by any of the following:

- The replica receives a SV which indicates that it has lagged so far behind the cluster that its log cannot possibly intersect.
- `repair_sync_timeout` fires, and:
    - a WAL or grid repair is in progress and,
    - the replica's checkpoint is lagging behind the cluster's (far enough that the repair may never complete).

### 6: Request Superblock Checkpoint State

The syncing replica sends `command=request_sync_checkpoint` messages (with the sync target identifier attached to each) until it receives a `command=sync_checkpoint` with a matching checkpoint identifier.

## Concepts
### Syncing Replica

Syncing replicas may:

- [write prepares to their WAL](#syncing-replicas-write-prepares-to-their-wal)
- assist with grid repair
- join new views
- send a `do_view_change`

Syncing replicas must not:

- [ack](#syncing-replicas-dont-ack-prepares)
- commit prepares
- be a primary

#### Syncing Replicas write prepares to their WAL.

When the replica completes superblock-sync, an up-to-date WAL and journal allow it to quickly catch up (i.e. commit) to the current cluster state.

#### Syncing Replicas don't ack prepares.

If syncing replicas _did_ ack prepares:

Consider a cluster of 3 replicas:
- the _primary_,
- a _normal backup_, and
- a _syncing backup_.

1. _Primary_ prepares many ops...
2. _Syncing backup_ prepares and acknowledges all of those messages.
3. _Normal backup_ is partitioned — its not seeing any of these prepares.
4. _Primary_ is receiving `prepare_ok`s from the _syncing backup_, so it is committing.
5. _Primary_ eventually checkpoints.
6. (This cycle repeats — _primary_ keeps preparing/committing, _syncing backup_ keeps preparing, and _normal backup_ is still partitioned.)

But now _primary_ is so far ahead that the _normal backup_ needs to sync!
Having 2/3 replicas syncing means that a single grid-block corruption on the primary could make the cluster permanently unavailable.

### Checkpoint Identifier

A _checkpoint id_ is a hash of the superblock `CheckpointState`.

A checkpoint identifier is attached to the following message types:
- `command=commit`: Current checkpoint identifier of sender.
- `command=ping`: Current checkpoint identifier of sender.
- `command=prepare`: The attached checkpoint id is the checkpoint id during which the corresponding prepare was originally prepared.
- `command=prepare_ok`: The attached checkpoint id is the checkpoint id during which the corresponding prepare was originally prepared.
- `command=request_sync_checkpoint`: Requested checkpoint identifier.
- `command=sync_checkpoint`: Current checkpoint identifier of sender.

### Sync Target

A _sync target_ is the [checkpoint identifier](#checkpoint-identifier) of the checkpoint that the superblock-sync is syncing towards.

Not all checkpoint identifiers are valid sync targets.

Every sync target **must**:
- have an op greater than the syncing replica's current checkpoint op.
- either:
  - be committed atop – i.e. the syncing replica can sync to `healthy_replica.checkpoint_op` when `trigger_for_checkpoint(healthy_replica.checkpoint_op) < healthy_replica.commit_min` – ensuring that the checkpoint has been reached by a quorum of replicas, or
  - be more than 1 checkpoint ahead of our current checkpoint.

### Storage Determinism

When everything works, storage is deterministic.
If non-determinism is detected (via checkpoint id mismatches) the replica which detects the mismatch will panic.
This scenario should prompt operator investigation and manual intervention.