Crates.io | tantivy-columnar |
lib.rs | tantivy-columnar |
version | 0.3.0 |
source | src |
created_at | 2023-06-09 09:12:11.984241 |
updated_at | 2024-04-12 04:48:49.545044 |
description | column oriented storage for tantivy |
homepage | https://github.com/quickwit-oss/tantivy |
repository | https://github.com/quickwit-oss/tantivy |
max_upload_size | |
id | 886116 |
size | 450,999 |
This crate describes columnar format used in tantivy.
This format is special in the following way.
(str, u64, i64, f64)
and different cardinality (required, optional, multivalued)
.Users can create a columnar by inserting rows to a ColumnarWriter
,
and serializing it into a Write
object.
Nothing prevents a user from recording values with different type to the same column_name
.
In that case, tantivy-columnar
's behavior is as follows:
tantivy-columnar
will simply emit several columns associated to a given column_name.tantivy-columnar
will pick the first type that can represents the set of appended value, with the following prioriy order (i64
, u64
, f64
).
i64
is picked over u64
as it is likely to yield less change of types. Most use cases strictly requiring u64
show the
restriction on 50% of the values (e.g. a 64-bit hash). On the other hand, a lot of use cases can show rare negative value.This columnar format may have more than one column (with different types) associated to the same column_name
(see Coercion rules above).
The (column_name, columne_type)
couple however uniquely identifies a column.
That couple is serialized as a column column_key
. The format of that key is:
[column_name][ZERO_BYTE][column_type_header: u8]
COLUMNAR:=
[COLUMNAR_DATA]
[COLUMNAR_KEY_TO_DATA_INDEX]
[COLUMNAR_FOOTER];
# Columns are sorted by their column key.
COLUMNAR_DATA:=
[COLUMN_DATA]+;
COLUMNAR_FOOTER := [RANGE_SSTABLE_BYTES_LEN: 8 bytes little endian]
The columnar file starts by the actual column data, concatenated one after the other, sorted by column key.
A sstable associates `(column name, column_cardinality, column_type) to range of bytes.
Column name may not contain the zero byte \0
.
Listing all columns associated to column_name
can therefore
be done by listing all keys prefixed by
[column_name][ZERO_BYTE]
The associated range of bytes refer to a range of bytes
This crate exposes a columnar format for tantivy. This format is described in README.md
The crate introduces the following concepts.
Columnar
is an equivalent of a dataframe.
It maps column_key
to Column
.
A Column<T>
asssociates a RowId
(u32) to any
number of values.
This is made possible by wrapping a ColumnIndex
and a ColumnValue
object.
The ColumnValue<T>
represents a mapping that associates each RowId
to
exactly one single value.
The ColumnIndex
then maps each RowId to a set of RowId
in the
ColumnValue
.
For optimization, and compression purposes, the ColumnIndex
has three
possible representation, each for different cardinalities.
All RowId have exactly one value. The ColumnIndex is the trivial mapping.
All RowIds can have at most one value. The ColumnIndex is the trivial mapping ColumnRowId -> Option<ColumnValueRowId>
.
All RowIds can have any number of values. The column index is mapping values to a range.
All these objects are implemented an unit tested independently in their own module: