Crates.io | tantivy-object-store |
lib.rs | tantivy-object-store |
version | 0.1.0 |
source | src |
created_at | 2023-08-14 14:03:05.854479 |
updated_at | 2023-08-14 14:03:05.854479 |
description | A tantivy Directory implementation against object stores (S3, GCS, etc.) |
homepage | |
repository | https://github.com/lancedb/tantivy-object-store |
max_upload_size | |
id | 944194 |
size | 95,711 |
This repo contains an implementation of a tantivy::directory::Directory
using an object_store::ObjectStore
.
This implementation supports both read and write, but does not support locking or file watch. The index building process is responsible for making sure that there are no concurrent index writers.
A few notable behavior differences from tantivy's directory implementations:
Tantivy uses a file called meta.json
which is a list of all the files that make up the index, effectively keeping track of a snapshot of the index. However, vanilla tantivy doesn't support versioning, meaning every time we update the index, meta.json is overwritten. This PR allows the caller to set a read_version and write_version. These version numbers are appended to the end of the file name when caller attempts to atomic_read or atomic_write.
When creating a ObjectStoreDirectory
, user may set read_version
and write_version
. read_version
is used when user calls atomic_read
. Instead of reading meta.json
, we will try to read meta.json.{read_version}
. Same when user calls atomic_write
, we will try to write meta.json.{write_version}
NOTE: The write_version
take precedence over read_version
. This means, after first write, atomic_read
will read from meta.json.{write_version}
NOT meta.json.{read_version}
. This is needed because tantivy modifies meta.json
file quite a few times during indexing, the CoW impl here needs have read-after-write consistency.
This implementation does not support reloading. If a watch
callback is registered, the callback will never be called. User needs to handle reloading via other mechanisms for now.
It maybe possible to use something like object store's native change notification to trigger reload, but that's for future work.
Since tantivy attempts to garbage collect and merge index files during indexing, we had to change delete
operation to noop. This is because we don't want tantivy to garbage collect files from past versions, as those files maybe in use by other readers. We will implement a garbage collection processes separately.
This implementation contains a tokio::Runtime
for running the IO jobs. This means, when calling functions from this implementation from inside another tokio runtime the caller should always use tokio::task::spawn_blocking
so the task can be scheduled on a thread without tokio runtime. (This is needed because nesting tokio runtimes causes panic)
This implementation is concurrency safe within a single instance, as the atomic_read|write
mechanism is a lock object in the returned trait object.
TBD