tantivy-uffd

Crates.io	tantivy-uffd
lib.rs	tantivy-uffd
version	0.1.1
created_at	2024-02-25 01:55:26.612634+00
updated_at	2024-06-19 17:59:58.782697+00
description	Query a remote tantivy index with HTTP range requests.
homepage
repository	https://github.com/ellenhp/tantivy-uffd/
max_upload_size
id	1152112
size	48,326

Ellen Poe (ellenhp)

documentation

README

Remote Directories for Tantivy

tantivy-uffd is a userfaultfd-based remote directory implementation for the tantivy search engine library. With this crate, you can run tantivy without needing a local copy of the search index, instead fetching pieces on-demand via HTTP range requests. This is similar in concept to tantivy-wasm, but doesn't run in the browser. It's intended for situations where you may want to co-locate a search index with multiple tantivy serving instances, or situations where you'd like to opt for extreme cost saving measures and host a static index on object storage.

Requirements

This crate should work on Linux >= 5.7. The endpoint you point tantivy-uffd at must support HTTP GET and HEAD requests, and must support range requests. I'm not actually sure what will happen if it doesn't support range requests. It's very possible that you'll get an OOM due to tantivy-uffd failing to notice and accidentally downloading the entire file. A user executing a program that uses tantivy-uffd should have read/write access to /dev/userfaultfd. If this is not the case, you will experience a permissions error when the directory is opened.

Caveats

There's a lot of unsafe code in this crate, and if a remote directory is ever dropped you should assume that there will be undefined behavior. It should be possible to fix this but I haven't yet. Even if you use it exactly as intended, there's probably still a fair amount of UB lurking in the shadows, as this crate was originally written as a proof of concept for the Airmail project, without a priori knowledge that it would even work.

Optimizations

Enabling the quickwit feature on tantivy will improve remote directory performance substantially, as will fetching documents in parallel with an appropriate number of spawn_blocking operations. A good starting point for CHUNK_SIZE can be found by estimating the bandwidth-delay product to your index host. I use 1-2MiB for Airmail. I would considering reducing this for an fst based index. If you search against a remote index with a large disjunction, it can be helpful to launch a search for each clause of the disjunction in parallel prior to searching for the main query. This can result in a lot of unnecessary traffic from your index, but userfaultfd page faults pause the entire thread and tantivy performs many dependent reads so launching a search for each clause of the disjunction starts a bunch of threads that will essentially perform readahead for the main thread. This can speed things up a lot.

I also would (counterintuitively) recommend configuring user-interface code to aggressively send queries as they're being typed, because it will warm the cache reducing perceived latency. The Airmail demo has a 100ms debounce on partial query requests (results are discarded) and a 500ms debounce on a final query whose results are displayed to the user. If you have the CPU cycles to spare this works very well.

Alternatives

You can implement your own remote directory by fetching slices on-demand in FileHandle::read_bytes, but you'll end up needing to cache chunks and perform a lot of copies of those chunks when you recieve a read_bytes request spanning chunk boundaries. This ended up being an unacceptable performance cost for Airmail.

License

Dual MIT/Apache 2.0, at your option. This crate contains some code from tantivy, namely query_len.rs. Copyright notices, the authors file and original MIT license have been preserved.

Commit count: 4