| Crates.io | seismic |
| lib.rs | seismic |
| version | 0.2.1 |
| created_at | 2024-07-04 04:33:51.067436+00 |
| updated_at | 2025-03-05 09:02:24.897126+00 |
| description | Seismic is designed for effective and efficient KNN retrieval over learned sparse embeddings. |
| homepage | |
| repository | https://github.com/TusKANNy/seismic |
| max_upload_size | |
| id | 1291310 |
| size | 1,457,549 |
Seismic is a highly efficient data structure for fast retrieval over learned sparse embeddings. Designed with scalability and performance in mind, Seismic makes querying sparse representations seamless.
To install Seismic, simply run:
pip install pyseismic-lsr
For performance optimizations, check out the detailed installation guide in docs/Installation.md.
Given a collection as a jsonl file (details here), you can quickly index it by running
json_input_file = "" # Your data collection
index = SeismicIndex.build(json_input_file)
print("Number of documents: ", index.len)
print("Avg number of non-zero components: ", index.nnz / index.len)
print("Dimensionality of the vectors: ", index.dim)
index.print_space_usage_byte()
and then exploit Seismic to quickly retrieve your set of queries
MAX_TOKEN_LEN = 30
string_type = f'U{MAX_TOKEN_LEN}'
query = {"a": 3.5, "certain": 3.5, "query": 0.4}
queries_ids = np.array([0])
query_components = np.array(list(query.keys()), dtype=string_type)
query_values = np.array(list(query.values()), dtype=np.float32)
results = index.batch_search(
queries_ids=queries_ids,
query_components=query_components,
query_values=query_values,
k=10
)
The embeddings in jsonl format for several encoders and several datasets can be downloaded from this HuggingFace repository, together with the queries representations.
As an example, the Splade embeddings for MSMARCO can be downloaded and extracted by running the following commands.
wget https://huggingface.co/datasets/tuskanny/seismic-msmarco-splade/resolve/main/documents.tar.gz?download=true -O documents.tar.gz
tar -xvzf documents.tar.gz
or by using the Huggingface dataset download tool.
Documents and queries should have the following format. Each line should be a JSON-formatted string with the following fields:
id: must represent the ID of the document as an integer.content: the original content of the document, as a string. This field is optional.vector: a dictionary where each key represents a token, and its corresponding value is the score, e.g., {"dog": 2.45}.This is the standard output format of several libraries to train sparse models, such as learned-sparse-retrieval.
The script convert_json_to_inner_format.py allows converting files formatted accordingly into the seismic inner format.
python scripts/convert_json_to_inner_format.py --document-path /path/to/document.jsonl --queries-path /path/to/queries.jsonl --output-dir /path/to/output
This will generate a data directory at the /path/to/output path, with documents.bin and queries.bin binary files inside.
If you download the NQ dataset from the HuggingFace repo, you need to specify --input-format nq as it uses a slightly different format.
Check out our docs folder for more detailed guide on use to use Seismic directly in Rust, replicate the results of our paper, or use Seismic with your custom collection.
The source code in this repository is subject to the following citation license:
By downloading and using this software, you agree to cite the under-noted paper in any kind of material you produce where it was used to conduct a search or experimentation, whether be it a research paper, dissertation, article, poster, presentation, or documentation. By using this software, you have agreed to the citation license.
SIGIR 2024
@inproceedings{Seismic,
author = {Sebastian Bruch and Franco Maria Nardini and Cosimo Rulli and Rossano Venturini},
title = {Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations},
booktitle = {The 47th International {ACM} {SIGIR} {C}onference on Research and Development in Information Retrieval ({SIGIR})},
pages = {152--162},
publisher = {{ACM}},
year = {2024},
url = {https://doi.org/10.1145/3626772.3657769},
doi = {10.1145/3626772.3657769},
}
CIKM 2024
@inproceedings{bruch2024pairing,
title={Pairing Clustered Inverted Indexes with $\kappa$-NN Graphs for Fast Approximate Retrieval over Learned Sparse Representations},
author={Bruch, Sebastian and Nardini, Franco Maria and Rulli, Cosimo and Venturini, Rossano},
booktitle={Proceedings of the 33rd ACM International Conference on Information and Knowledge Management},
pages={3642--3646},
year={2024}
}