Crates.io | tantivy-sstable |
lib.rs | tantivy-sstable |
version | |
source | src |
created_at | 2023-06-09 09:05:04.310514+00 |
updated_at | 2025-04-09 08:59:31.279905+00 |
description | sstables for tantivy |
homepage | https://github.com/quickwit-oss/tantivy |
repository | https://github.com/quickwit-oss/tantivy |
max_upload_size | |
id | 886106 |
Cargo.toml error: | TOML parse error at line 17, column 1 | 17 | autolib = false | ^^^^^^^ unknown field `autolib`, expected one of `name`, `version`, `edition`, `authors`, `description`, `readme`, `license`, `repository`, `homepage`, `documentation`, `build`, `resolver`, `links`, `default-run`, `default_dash_run`, `rust-version`, `rust_dash_version`, `rust_version`, `license-file`, `license_dash_file`, `license_file`, `licenseFile`, `license_capital_file`, `forced-target`, `forced_dash_target`, `autobins`, `autotests`, `autoexamples`, `autobenches`, `publish`, `metadata`, `keywords`, `categories`, `exclude`, `include` |
size | 0 |
The tantivy-sstable
crate is yet another sstable crate.
It has been designed to be used in quickwit
:
The benefit compared to the fst crate is locality. Searching a key in the fst crate requires downloading the entire dictionary.
Once the sstable index is downloaded, running a get
in the sstable
crate only requires a single fetch.
Right now, the block index and the default block size have been thought for quickwit, and the performance of a get is very bad.
SSTable stands for Sorted String Table. Strings have to be insert in sorted order.
That sorted order is used in different ways:
Overview of the SSTable format. Unless noted otherwise, numbers are little-endian.
+-------+-------+-----+--------+
| Block | Block | ... | Footer |
+-------+-------+-----+--------+
|----( # of blocks)---|
SSTBlock
): list of independent block, terminated by a single empty block.SSTFooter
)+----------+----------+--------+-------+-------+-----+
| BlockLen | Compress | Values | Delta | Delta | ... |
+----------+----------+--------+-------+-------+-----+
| |----( # of deltas)---|
|------(maybe compressed)------|
+---------+--------+
| KeepAdd | Suffix |
+---------+--------+
KeepAdd can be represented in two different representation, a very compact 1byte one which is enough for most usage, and a longer variable-len one when required
When keep < 16 and add < 16
+-----+------+
| Add | Keep |
+-----+------+
Otherwise:
+------+------+-----+
| 0x01 | Keep | Add |
+------+------+-----+
Add(VInt): number of bytes to push
Keep(VInt): number of bytes to pop
Note: as the SSTable does not support redundant keys, there is no ambiguity between both representation. Add is always guaranteed to be non-zero, except for the very first key of an SSTable, where Keep is guaranteed to be zero.
+-----+----------------+-------------+-------------+---------+---------+
| Fst | BlockAddrStore | StoreOffset | IndexOffset | NumTerm | Version |
+-----+----------------+-------------+-------------+---------+---------+
Fst is in the format of tantivy_fst
+---------+-----------+-----------+-----+-----------+-----------+-----+ | MetaLen | BlockMeta | BlockMeta | ... | BlockData | BlockData | ... | +---------+-----------+-----------+-----+-----------+-----------+-----+ |---------(N blocks)----------|---------(N blocks)----------|
+--------+------------+--------------+------------+--------------+-------------------+-----------------+----------+ | Offset | RangeStart | FirstOrdinal | RangeSlope | OrdinalSlope | FirstOrdinalNBits | RangeStartNBits | BlockLen | +--------+------------+--------------+------------+--------------+-------------------+-----------------+----------+
+-----------------+-------------------+---------------+ | RangeStartDelta | FirstOrdinalDelta | FinalRangeEnd | +-----------------+-------------------+---------------+ |------(BlockLen repetitions)---------|
converting a BlockData of index Index and a BlockAddrBlockMetadata to an actual block address is done as follow: range_prediction := RangeStart + Index * RangeSlop; range_derivation := RangeStartDelta - (1 << (RangeStartNBits-1)); range_start := range_prediction + range_derivation
The same computation can be done for ordinal.
Note that range_derivation
can take negative value. RangeStartDelta
is just its translation to a positive range.
The format used for the index is meant to be compact, however it has a constant cost of around 70 bytes, which isn't negligible for a table containing very few keys. To limit the impact of that constant cost, single block sstable omit the Fst and BlockAddrStore from their index. Instead a block with first ordinal of 0, range start of 0 and range end of IndexOffset is implicitly used for every operations.