Experimental results show that single-threaded query processing using Seismic, reaches sub-millisecond per-query latency on various sparse embeddings of the MSMarco dataset while maintaining high recall. The results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions and further outperforms the winning (graph-based) submissions to the BigANN Challenge by a significant margin. See the paper [[1](#bib)] for more details. ## Experimental Results We report here a comparison with other state-of-the-art indexes for sparse vectors. See the paper [[1](#bib)] for more detailed experimental evaluation.
### How to Replicate the Experiments To run the experiments with Seismic, we need to compile the binary executables using: ```bash RUSTFLAGS="-C target-cpu=native" cargo build --release ``` This command produces three executables: `build_inverted_index`, `perf_inverted_index`, and `generate_groundtruth` in the `/target/release/` directory. The `build_inverted_index` executable is used to construct an inverted index for a dataset. Both dataset and query files are stored in an internal binary format. Refer to the [Python scripts](#scripts) section for a script to convert a dataset from JSON format. This process involves several parameters that regulate space/time trade-offs: - `--n-postings`: Regulates the size of the posting list, representing the average number of postings stored per posting list. - `--summary-energy`: Controls the size of the summaries, preserving a fraction of the overall energy for each summary. - `--centroid-fraction`: Determines the number of centroids built for each posting list, capped at a fraction of the posting list length. For Splade on MSMarco, good choices are `--n-postings 3500`, `--summary-energy 0.4`, and `--centroid-fraction 0.1`. The following command can be used to create a Seismic index serialized in the file `splade.bin.3500.seismic`: ```bash ./target/release/build_inverted_index -i splade.bin -o splade.bin.3500_0.4_0.1 --centroid-fraction 0.1 --summary-energy 0.4 --n-postings 3500 ``` To execute a set of queries, use the `perf_inverted_index` executable. Two parameters, `query-cut` and `heap-factor`, trade-off efficiency vs accuracy: - `--query-cut`: The search algorithm considers only the top `query_cut` components of the query. - `--heap-factor`: The search algorithm skips a block whose estimated dot product is greater than `heap_factor` times the smallest dot product of the top-k results in the current heap. The following command exemplifies this: ```bash ./target/release/perf_inverted_index -i splade.bin.3500_0.4_0.1c -q splade_queries.bin -o results.tsv --query-cut 5 --heap-factor 0.7 ``` The dataset of queries is in binary internal format. Refer again to the [Python scripts](#scripts) section for a script to convert a dataset from JSON format. The executable prints the average running time per query. Queries are executed in single-thread mode. To enable multithreading, modify the Rust code by replacing the iteration on the query from `queries.iter().take(n_queries).enumerate()` to `queries.par_iter().take(n_queries).enumerate()`. The results are written in the file `results.tsv`. For each query, there are `k` lines, one for each of its results. Each line follows this format: ```text query_id\tdocument_id\tresult_rank\tdot_product ``` Here, `query_id` is a progressive identifier for the query, `document_id` is the identifier of the document in the indexed dataset, and `result_rank` indicates their rank in the ordering by their `dot_product` with the query. To evaluate the accuracy of the retrieved results against an already computed ground truth, use the Python script `scripts/accuracy.py`: ```bash python3 scripts/accuracy.py groundtruth.tsv results.tsv ``` This will output the recall percentage. The ground truth for a dataset can be computed with `generate_groundtruth` as follows: ```bash ./target/release/generate_groundtruth -i splade.bin -q splade_queries.bin -o groundtruth.tsv ``` The exact top-10 results of each query are written in the file `groundtruth.tsv` with the format described above. Even if multithreading is enabled here, the execution may take a considerable amount of time due to the brute-force exact query algorithm that scans the entire dataset for each query. ### Seismic Parameters The table below reports the building parameters we used for the different datasets. | Dataset | `--n-postings` | `--centroid-fraction` |`--summary-energy` | |-----------------|-----------------:|-----------------:|----------------:| | MSMARCO-Splade | 4000 | 0.1 | 0.4 | | MSMARCO-Effsplade | 4000 | 0.1 | 0.4 | | MSMARCO-UniCOIL-T5 | 4000 | 0.1 | 0.4 | | NQ-Splade | 3500 | 0.15 | 0.5 | The table below reports the query parameters we used for the different datasets and various levels of accuracy. Results may change slightly if you re-create the indexes, due to the random selection of the centroids of Seismic. **Splade on MSMARCO** | $hf$ | $q_c$ | Time \[ $\mu s$ \] | Accuracy@10 | |-----------------|-----------------:|-----------------:|----------------:| | 0.9 | 3 | 187 | 90.49 | | 0.9 | 4 | 206 | 92.27 | | 0.9 | 4 | 206 | 92.27 | | 0.9 | 5 | 222 | 93.13 | | 0.9 | 8 | 269 | 94.07 | | 0.8 | 5 | 303 | 95.69 | | 0.8 | 6 | 348 | 96.11 | | 0.7 | 6 | 531 | 97.17 | **E-Splade on MSMARCO** | $hf$ | $q_c$ | Time \[ $\mu s $\] | Accuracy@10 | |-----------------|-----------------:|-----------------:|----------------:| | 1 | 3 | 222 | 90.99 | | 1 | 4 | 222 | 90.99 | | 1 | 4 | 239 | 93.26 | | 1 | 4 | 239 | 93.26 | | 1 | 6 | 256 | 94.17 | | 0.9 | 4 | 376 | 95.95 | | 0.9 | 5 | 383 | 96.53 | | 0.8 | 5 | 581 | 97.47 | **Unicoil-T5 on MSMARCO** | $hf$ | $q_c$ | Time \[ $\mu s$ \] | Accuracy@10 | |-----------------|-----------------:|-----------------:|----------------:| | 1 | 3 | 115 | 90.04 | | 1 | 4 | 123 | 91.33 | | 1 | 6 | 133 | 92.07 | | 0.9 | 3 | 168 | 94.03 | | 0.9 | 3 | 168 | 94.03 | | 0.9 | 4 | 180 | 95.13 | | 0.8 | 4 | 268 | 96.76 | | 0.8 | 5 | 280 | 97.19 | **NQ on MSMARCO** | $hf$ | $q_c$ | Time \[ $\mu s$ \] | Accuracy@10 | |-----------------|-----------------:|-----------------:|----------------:| | 1 | 3 | 195 | 91.25 | | 1 | 4 | 195 | 91.25 | | 1 | 6 | 211 | 92.23 | | 0.9 | 3 | 240 | 93.24 | | 0.9 | 3 | 266 | 95.17 | | 0.9 | 4 | 266 | 95.17 | | 0.8 | 4 | 286 | 96.26 | | 0.8 | 5 | 362 | 97.18 | We provide a script to explore the search parameters given a trained index; the script is `scripts/grid_search_only_accuracy.sh`. You can use it as follows: ```bash index_path="" results_file_path="" queries_path="" gt_path="" bash scripts/grid_search_only_accuracy.sh $index_path $results_file_path $queries_path $gt_path ``` The script writes the result of the grid search in `results_file_path`. To get the fastest configuration at each accuracy cut, simply run ```bash python scripts/extract_results.py --file-path $results_file_path ``` ## Python Scripts ### Download the Datasets The embeddings in ```jsonl``` format used in our experiments can be downloaded from this HugginFace [repository](https://huggingface.co/collections/tuskanny/seismic-datasets-6610108d39c0f2299f20fc9b), together with the queries representations. As an example, the Splade embeddings for MsMarco can be downloaded and extracted by runnning the following commands. ```bash wget https://huggingface.co/datasets/tuskanny/seismic-msmarco-splade/resolve/main/documents.tar.gz?download=true -O documents.tar.gz tar -xvzf documents.tar.gz ``` or by using the Huggingface dataset download [tool](https://huggingface.co/docs/hub/en/datasets-downloading). ### Convert the Data Documents and queries should have the following format. Each line should be a JSON-formatted string with the following fields: - `id`: must represent the ID of the document as an integer. - `content`: the original content of the document, as a string. This field is optional. - `vector`: a dictionary where each key represents a token, and its corresponding value is the score, e.g., `{"dog": 2.45}`. This is the standard output format of several libraries to train sparse models, such as [`learned-sparse-retrieval`](https://github.com/thongnt99/learned-sparse-retrieval). The script ```convert_json_to_inner_format.py``` allows converting files formatted accordingly into the ```seismic``` inner format. ```bash python scripts/convert_json_to_inner_format.py --document-path /path/to/document.jsonl --queries-path /path/to/queries.jsonl --output-dir /path/to/output ``` This will generate a ```data``` directory at the ```/path/to/output``` path, with ```documents.bin``` and ```queries.bin``` binary files inside. If you download the NQ dataset from the HuggingFace repo, you need to specify ```--input-format nq``` as it uses a slightly different format. ## Using the Rust Code To incorporate the Seismic library into your Rust project, navigate to your project directory and run the following Cargo command: ```bash cargo add seismic ``` This command adds the Seismic library to your project. #### Creating a Toy Dataset Let's create a toy dataset comprising vectors with `f32` values. Next, we'll convert this dataset to use half-precision floating points ([`half::f16`](https://docs.rs/half/latest/half/)). Finally, we'll check the number of vectors, the dimensionality, and the number of non-zero components of the dataset. ```rust use seismic::SparseDataset; use half::f16; let data = vec![ (vec![0, 2, 4], vec![1.0, 2.0, 3.0]), (vec![1, 3], vec![4.0, 5.0]), (vec![0, 1, 2, 3], vec![1.0, 2.0, 3.0, 4.0]) ]; let dataset: SparseDataset