# Storage Basics This article explains the important types/traits at the storage layer. ## Types and traits ### Array & ArrayBuilder Data is stored in columns in RisingLight. Here a column is referred to as an "array". `Array` is a trait over all arrays, specifyng the interface for all arrays. `PrimitiveArray` implements the `Array` trait, where `T` can be `bool`, `i32`, etc. `ArrayImpl` is an enum over different arrays. `ArrayBuilder` is a trait over array builders that builds `Array`. Similar to how `PrimitiveArray` implements `Array`, `PrimitiveArrayBuilder` implements `ArrayBuilder`. `ArrayBuilderImpl` is an enum over different array builders. Some fields/methods are shown below: ```rust pub trait Array: Sized + Send + Sync + 'static { /// Corresponding builder of this array. type Builder: ArrayBuilder; /// Type of element in the array. type Item: ToOwned + ?Sized; /// Retrieve a reference to value. fn get(&self, idx: usize) -> Option<&Self::Item>; /// Get iterator of current array. fn iter(&self) -> ArrayIter<'_, Self> { ArrayIter::new(self) } ... } pub trait ArrayBuilder: Sized + Send + Sync + 'static { /// Corresponding `Array` of this builder type Array: Array; /// Append a value to builder. fn push(&mut self, value: Option<&::Item>); /// Take all elements and return a new array. fn take(&mut self) -> Self::Array; /// Finish build and return a new array. fn finish(mut self) -> Self::Array { self.take() } ... } ``` ### DataChunk & DataChunkBuilder `DataChunk` is a list of arrays, representing a contiguous set of rows. `DataChunkBuilder` uses a list of array builders to build data chunks. ```rust pub struct DataChunk { arrays: Arc<[ArrayImpl]>, ... } pub struct DataChunkBuilder { array_builders: Vec, ... } impl DataChunkBuidler { // Add a row (a list of DataValue's) to the data chunk fn push_row(&mut self, row: impl IntoIterator) -> Option; // Finishes building data chunk fn take(&mut self) -> Option; } ``` ### MemTable Each transaction has an in-memory table as a write buffer. The `MemTable` trait specifies that the mem-table should take `DataChunk`s and outputs a `DataChunk`. ```rust pub trait MemTable { /// add data to memory table fn append(&mut self, columns: DataChunk) -> StorageResult<()>; /// flush data to [`DataChunk`] fn flush(self) -> StorageResult; } pub struct BTreeMapMemTable { columns: Arc<[ColumnCatalog]>, primary_key_idx: usize, multi_btree_map: BTreeMultiMap, // } ``` `BTreeMapMemTable` implements `MemTable`. At `append`, it adds each row in the data chunk to its internal B-tree map. At `flush`, it creates array builders using type information from `ColumnCatalog`, builds arrays from rows using those builders, and constructs a `DataChunk` from those arrays. ### SecondaryMemRowSet `SecondaryMemRowsetImpl` represents a mem-table that flushes to disk. Those flushed data will be in `EncodedRowset` format. ```rust pub struct SecondaryMemRowset { mem_table: M, rowset_id: u32, rowset_builder: RowsetBuilder, } impl SecondaryMemRowset { pub fn append(&mut self, columns: DataChunk) -> StorageReulst<()> { self.mem_table.append(columns) } pub fn flush(self, ..) -> StorageResult<()> { let chunk = self.mem_table.flush(); self.rowset_builder.append(chunk); let encoded_rowset = self.rowset_builder.finish(); let writer = RowsetWriter::new(); writer.flush(encoded_rowset); } } pub enum SecondaryMemRowsetImpl { BTree(SecondaryMemRowset), Column(SecondaryMemRowset), } ``` ### Columns & Blocks An `EncodedColumn` is a column in serialized format, containing both index and data of the column in binary. Column builders, defined by the `ColumnBuilder` trait, return `(Vec, Vec)` when finishing a column. Note the returned index (`Vec`) is not in binary format yet and `IndexBuilder` serializes `Vec` to `Vec` later. ```rust pub struct EncodedColumn { index: Vec, data: Vec } pub trait ColumnBuilder { /// Append an [`Array`] to the column. [`ColumnBuilder`] will chunk it into small parts. fn append(&mut self, array: &A); /// Finish a column, return (index, data) for the block fn finish(self) -> (Vec, Vec); } ``` `PrimitiveColumnBuilder` implements `ColumnBuilder` trait. Internally, `PrimitiveColumnBuilder` rechunks data into `Block`s using `BlockBuilder`s. A `Block` is the minimum operation unit of secondary storage. Details about blocks are omitted here. ```rust pub struct PrimitiveColumnBuilder { data: Vec, /// Current block builder current_builder: Option>, /// Block index builder block_index_builder: BlockIndexBuilder, ... } ``` `ColumnBuilderImpl` is an enum of different `PrimitiveColumnBuilder`s. ```rust impl ColumnBuilderImpl { pub fn new_from_datatype(datatype: &DataType, options: ColumnBuilderOptions) -> Self; pub fn append(&mut self, array: &ArrayImpl); pub fn finish(self) -> (Vec, Vec); } ``` ### EncodedRowset & RowsetBuilder An `EncodedRowset` is a collection of `EncodedColumns`. `RowsetBuilder` builds it using column builders. ```rust pub struct EncodedRowset { columns_info: Arc<[ColumnCatalog]>, columns: Vec, ... } pub struct RowsetBuilder { columns: Arc<[ColumnCatalog]>, builders: Vec, ... } impl RowsetBuilder { fn append(&mut self, chunk: DataChunk) { for idx in 0..chunk.column_count() { self.builders[idx].append(chunk.array_at(idx)); } } pub fn finish(self) -> EncodedRowset { EncodedRowset { columns_info: self.columns.clone(), columns: self .builders .into_iter() .map(|builder| { let (block_indices, data) = builder.finish(); let mut index_builder = IndexBuilder::new(...); for index in block_indices { index_builder.append(index); } let index = index_builder.finish(); EncodedColumn { index, data } }) .collect_vec(), } } } ``` ### Transaction: Flush to Secondary Storage A `SecondaryTransaction` represents a transaction that persists data to disk. A transaction buffers the writes (e.g. insertions) in the mem-table and only flushes to disk when (1) the mem-table exceeds a certain size threshold, or (2) the transaction commits. ```rust pub struct SecondaryTransaction { mem: Option, ... } impl SecondaryTransaction { fn flush_rowset(&mut self) { self.mem.flush(); Add flushed rowset to self.to_be_committed_rowsets; } pub fn append(&mut self, columns: DataChunk) { if self.mem.is_none() { self.mem = ... } self.mem.append(columns); if memtable_size_exceeds_threshold() { self.flush_rowset(); } } pub fn commit(mut self) -> StorageResult<()> { self.flush_rowset(); // Skipped: Flush DVs, Commit changeset to manifests, etc. } } ``` ### Summary - `Array` represents a column in memory. - `DataChunk` represents a contiguous set of rows, implemented as a collection of `Array`'s. - `MemTable` is an in-memory write buffer. `append` takes `DataChunk`. `flush` outputs `DataChunk`. - `SecondaryMemRowset` is a mem-table that flushes to secondary storage. `append` appends to its `MemTable`. `flush` flushes the `DataChunk` returned by `MemTable` to disk as `EncodedRowset`. - `EncodedColumn` represents a column to be persisted to the secondary storage. It contains the index and data of a column. `ColumnBuilder` takes array at `append`, and outputs `EncodedColumn` at `flush`, roughly. - `EncodedRowset` is a collection of `EncodedColumn`s. `RowsetBuilder` uses column builders internally. `RowsetBuilder` takes `DataChunk` at `append`, and outputs `EncodedRowset` at `flush`. ![types and traits](images/06-storage-basics-01.svg) ## `INSERT` execution Let's walk through the whole process of executing `INSERT INTO table VALUES (1, '11')`: - `VALUES (1, '11')` is stored as a `DataChunk` and passed to the insert executor. - The transaction adds the `DataChunk` to its memory-table. - Whenever the mem-table is flushed (exceeding threshold/commit), a new `DataChunk` is produced. - The `DataChunk` is passed to `RowsetBuilder`. - When the `RowsetBuilder` finishes, it produces an `EncodedRowset`, which is persisted to disk. ![types and traits](images/06-storage-basics-02.svg)