# SlateDB Timestamps & Time-To-Live - [Motivation](#motivation) - [Goals](#goals) - [Out of Scope](#out-of-scope) - [Public API](#public-api) - [Implementation](#implementation) * [Data Format](#data-format) * [Compaction](#compaction) - [Rejected Alternatives](#rejected-alternatives--follow-ups) - [Updates](#updates) Status: Accepted Authors: * [Almog Gavra](https://github.com/agavra) References: * https://github.com/slatedb/slatedb/issues/218 * https://github.com/slatedb/slatedb/issues/225 * https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp * https://github.com/facebook/rocksdb/wiki/Time-to-Live * https://github.com/facebook/rocksdb/wiki/Compaction-Filter * https://dl.acm.org/doi/pdf/10.1145/3483840 * https://disc-projects.bu.edu/lethe/ * https://dl.acm.org/doi/10.1145/3555041.3589719 ## Motivation SlateDB does not have a native understanding of time in data operations (it does in regard to compaction/garbage collection, discussion on this later). This introduces additional complexity to operators that want to implement semantics that depend on the insertion/modification time of rows. The most common usages of timestamps are: 1. **Time-To-Live**: many applications, either for compliance or space-saving purposes require purging data a certain amount of time after its insertion. Use cases with time series data and privacy requirements such as GDPR come to mind here, though ensuring that deleted data is collected within a time bound is left for a future exercise. 2. **Resolving Out-of-Order Operations (out of scope)**: some applications may process incoming requests in a manner that results in out-of-order insertions to SlateDB (stream processing applications are frequently subject to this type of pattern). Using insertion timestamps to resolve which version of a key is more recent may be preferable to using the clock at the time of insertion. 3. **Historical Reads (out of scope)**: while SlateDB provides snapshot capabilities to ensure reads are consistent and repeatable, it requires a snapshot to have been taken at the desired time. Increasing the number of snapshots adds significant storage overhead. Instead, it is possible to trade read-latency for space amplification by maintaining multiple versions of keys at different timestamps to allow for similar "historical" read operations. ## Goals - Define the API for specifying timestamps via user defined clocks - Define the changes to the on-disk format of stored rows - Outline mechanism to support Time-To-Live (TTL) via Compaction Filters ## Out of Scope While the following are out of scope for the initial implementation, the design should not preclude: - Handling out-of-order insertions is not in scope for this RFC, though we make sure the design accounts for the future possibility (see rejected alternatives section) - Outlining compaction strategies that are optimized for TTL (such as Cassandra's [TWCS](https://cassandra.apache.org/doc/stable/cassandra/operating/compaction/twcs.html)). This will require additional metadata per-SST and a compaction strategy to match, but is otherwise possible with what's outlined in the scope of this design. - Outlining compaction strategies that target efficient deletes (such as outlined in [Lethe](https://disc-projects.bu.edu/lethe/)) - Ensuring deletions are resolved within a time bound (a TTL does not ensure the data is deleted from the object store within the specified time) - A public API for compaction filters (see https://github.com/slatedb/slatedb/issues/225) - this RFC will implement TTL using a compaction filter but will not yet expose a public API for plugging in custom compaction filters. Instead, the implementation should ensure that such an extension is easy. ## Public API We will make the following changes to `DbOptions`: - introduce a configuration for a custom clock, and will default to using the system clock - introduce a configuration for a default TTL, which defaults to `None` Note that the clock is expected to return monotonically increasing (greater than or equal to previous) values. This will be enforced by wrapping the supplied `Clock` implementation that tracks the previous values returned. ```rust /// defines the clock that SlateDB will use during this session pub trait Clock { /// Returns a timestamp in milliseconds since the unix epoch, /// must return monotonically increasing numbers (this is enforced /// at runtime and will panic if the invariant is broken) /// /// Note that this clock does not need to return a number that /// represents the unix timestamp; the only requirement is that /// it represents a sequence that can attribute a logical ordering /// to actions on the database fn now(&self) -> i64; } /// contains the default implementation of the Clock, and will return the system time pub struct SystemClock; // ... pub struct DbOptions { // ... /// The Clock to use for insertion timestamps /// /// Default: the default clock uses the local system time on the machine clock: Option>, /// The default time-to-live (TTL) for insertions (note that re-inserting a key /// with any value will update the TTL to use the default_ttl) /// /// Default: no TTL (insertions will remain until deleted) default_ttl: Option } ``` To enforce that the clock remains monotonically increasing across instantiations of the database, we will add the maximum timestamp written in both the manifest and each SST ( see the change to `SsTableInfo` below). When seeding the database minimum timestamp, SlateDB will use the maximum value seen in all persisted WAL SSTs and the latest manifest. ```fbs table ManifestV1 { // ... timestamp: long; } ``` `Db#put_put_with_options` can optionally specify a row-level TTL at insertion time by leveraging newly added field in `WriteOptions`: ```rust pub struct WriteOptions { /// The time-to-live (ttl) for this insertion. If this insert overwrites an existing /// database entry, the TTL for the most recent entry will be canonical. /// /// Default: the TTL configured in DbOptions when opening a SlateDB session pub ttl: Option, } ``` For this RFC we will not introduce the ability to disable filtering out TTL-expired records on retrieval. The default implementation will read the metadata for every retrieved row and filter out records where `ts + tll < now`. We can, as a future optimization, enable lazy deserialization of the metadata so that this doesn't need to happen on each row retrieved if the user requests the ability to retrieve TTL'd rows that have not yet been compacted. This can be part of the API design for compaction filters (see https://github.com/slatedb/slatedb/issues/225) ## Implementation ### Data Format Currently, SlateDB does not have a mechanism for backwards-compatibility of the SST data format. We will first introduce the following fields to the flatbuffer definitions: - `format_version`: field that specifies the codec version that encoded this SST - `row_features`: an enumeration of metadata fields that are available with each encoded row. This makes it possible in the future to save space per row by disabling row attributes, but that is out of scope for this RFC - `creation_timestamp`: the time at which this SST file was written. For this RFC this timestamp is used to schedule follow-up compactions (see [periodic compaction](#periodic-compaction)). - `min_ts`/`max_ts`: metadata that indicates the earliest and latest timestamp values in the SST ```fbs /// the metadata encoded with each row, note that the ordering of this enum is /// sensitive and must be maintained enum RowFeature: byte { RowFlags, Timestamp, TimeToLive, } table SsTableInfo { // ... // The current encoding version of the data block format_version: uint; // The metadata attributes that are encoded with each row. These attributes will be encoded // in order that they are declared in the RowAttributes enum row_features: [RowFeature]; // The time at which this SST was created, based on the configured Clock in // DbOptions creation_timestamp: long; // The minimum timestamp of any row in the SST, based on the configured Clock in // DbOptions min_ts: long; // The maximum timestamp of any row in the SST, based on the configured Clock in // DbOptions max_ts: long; } ``` This value will be written when the SST is created. We will read this back using a codec that maps 1:1 with the format version. The version 0 has the following per-value format: ``` | u16 | var | u32 | var | |---------|-----|-----------|-------| | key_len | key | value_len | value | |---------|-----|-----------|-------| ``` The `format_version` will be bumped to 1 and will be modified to have the following schema: ``` | u16 | var | var | u32 | var | |---------|-----|------|-----------|-------| | key_len | key | attr | value_len | value | |---------|-----|------|-----------|-------| ``` The newly introduced `attr` field will be decoded using the `row_features` array specified for this SST. Which row attributes are included depend on the data that is being written. If any row in the SST has a `ttl`, the `TimeToLive` feature will be enabled. As part of this proposal, we will no longer represent tombstones as `INT_MAX` value_len and instead have an indicator `byte` in the attribute array which specifies the `RowFlags`. `RowFlags` is a bitmask of descriptors on the row. The first bit in `RowFlags` is set to `1` if the row is a tombstone. Other bits are reserved for future flags. Later we can add other types of row attributes as necessary (since flatbuffers does not support single-bit values, a byte is the smallest amount of data we can store here): ``` |-- attr --| | u16 | var | byte | u32 | var | |---------|-----|----------|-----------|-------| | key_len | key | row_flag | value_len | value | |---------|-----|----------|-----------|-------| ``` In cases where both `Timestamp` and `TimeToLive` are enabled, the row will look like: ``` |-------- attr --------------| | u16 | var | byte | i64 | i64 | u32 | var | |---------|-----|----------------------------|-----------|-------| | key_len | key | row_flag | ts | expire_ts | value_len | value | |---------|-----|----------|-----|-----------|-----------|-------| ``` Both `ts` and `expire_ts` are encoded as `i64` in milliseconds. The metadata will be returned in the decoded `KeyValueDeletable` which will be modified: ```rust pub struct KeyValueDeletable { pub key: Bytes, pub value: ValueDeletable, pub attributes: RowAttributes, } pub struct RowAttributes { pub ts: Option, pub expire_ts: Option, } ``` ### Compaction #### TTL Compaction Filter We will introduce the concept of a compaction filter, which serves as a mechanism for pluggable participation in the compaction process. Specifically, the filter (if configured) will run on each key during the compaction process and optionally remove or modify the value. This is independent of the periodic compaction interval specified in the following section. For the scope of this RFC, we will be introducing a builtin TS/TTL compaction filter. This filter will add a tombstone to any value where `db.clock.now() > kv.attributes.expire_ts`. Note that tombstones will still contain a timestamp see [out of order insertions](#insertion-timestamps). If a tombstone is added, the TTL (`expire_ts`) for the row will be cleared, indicating that a future run of this filter should not remove the tombstone. Since this RFC does not cover the public API for compaction filters, this filter will simply run on any SST that specifies `row_features: [Timestamp && TimeToLive]`. #### Periodic Compaction If the SST is written with the `TimeToLive` row attribute enabled, the compaction scheduler will compare the SST's `SsTableInfo#timestamp` with a newly introduced compaction config for periodic compaction time limits: ```rust use std::time::Duration; pub struct CompactorOptions { // ... // When there are SSTs in L0 (or Sorted Runs with SSTs in other levels) older than // `periodic_compaction_interval`, SlateDB will include them in the next scheduled // compaction -- this is a best effort interval and slateDB does not guarantee that // compaction will happen within this interval (implementations of the compaction scheduler // are free to implement jitters or other mechanisms to ensure that SSTs are not // all compacted simultaneously) periodic_compaction_interval: Duration } ``` This will ensure that Sorted Runs or SSTs that are not otherwise scheduled by the compaction scheduler will be picked up and TTL / tombstones will be compacted. This is particularly relevant for the lower levels of the LSM tree. ## Rejected Alternatives & Follow-Ups ### Insertion Timestamps A previous version of this proposal introduced the concept of row level timestamps -- the ability to set a timestamp at each insertion instead of only specifying a custom clock. This introduced the ability to have out-of-order insertions, which in turn caused significant downstream complications. In order to leave the door open for timestamps on insertion, we encode the clock time with each row in the SST as opposed to only maintaining the time that it should expire. This comes at the cost of an extra 4 bytes when cached in memory (compression will probably mean it is stored as fewer bytes). Another design consideration in the RFC that leaves the door open for insertion timestamps is the association of a timestamp with tombstones. This is to handle the scenario where a record with a TTL is inserted with a lower timestamp than an existing record. Consider the following: ``` seq0: Insert K1 with TS=100 and TTL=100000 seq1: Insert K1 with TS=50 and TTL=100 ``` If `seq0` and `seq1` result in writes in different levels of the LSM tree, when a compaction kicks in at TS=170 a tombstone is generated for `K1`. This tombstone should not result in a deletion of the original `seq0` insert as it logically happened "after" `seq1`. ## Updates ### September 25, 2024 - Removed out-of-order insertions - Removed the functionality to retrieve metadata of a row. Without multiple copies of the row the insertion time is less interesting. we can consider adding this functionality back in upon request. - Added the requirement for the clock to be monotonically increasing - Added a timestamp field to `ManifestV1` to seed clock - Added a creation_timestamp field to `SsTableInfo` and `periodic_compaction_seconds` to the `CompactorOptions` - Added a `RowType` metadata feature and used it to replace the current encoding of tombstones ### October 1, 2024 - Changed `RowType` to `RowFlags` and indicated that the first bit is used to identify tombstones - Added `min_ts`/`max_ts` into `SsTableInfo` - Updated the seeding of the clock min timestamp to look at the WAL SSTs as well as the manifest - Modified the stored value in the SST from `ttl` to `expire_ts` - Marked `periodic_compaction_interval` as "best effort" to allow for implementations to optimize performance characteristics such as adding a jitter or max number of SSTs to compact in one go - Many misc. improvements to readability and clarifications ### October 6, 2024 - Minor changes and feedback. Most notable is renaming "meta features" to "row attributes"