| Crates.io | bm25-vectorizer |
| lib.rs | bm25-vectorizer |
| version | 1.0.0 |
| created_at | 2025-10-06 19:07:10.434484+00 |
| updated_at | 2025-10-06 19:07:10.434484+00 |
| description | A minimal Rust library for creating sparse vector representations (embeddings) using the BM25 algorithm for information retrieval. |
| homepage | |
| repository | https://github.com/ep9io/bm25-vectorizer |
| max_upload_size | |
| id | 1870661 |
| size | 84,760 |
A minimal Rust library for creating sparse vector representations using the BM25 algorithm. These vectors can be efficiently stored in vector databases like Qdrant for keyword-based information retrieval.
BM25 is a probabilistic ranking algorithm that calculates relevance scores between queries and documents based on term frequency and inverse document frequency. This library's implementation produces only the normalised term frequency (TF) component in document vectors and expects the inverse document frequency (IDF) to be computed by the vector database. This approach allows IDF to automatically update as documents are added or removed without re-encoding existing documents.
NOTE: Vector databases might require to specify an IDF modifier when setting up the vector store to instruct them to calculate IDF statistics automatically.
This library was created to address the following gaps with existing Rust solutions (Sep 2025):
A minimal‑dependency library for generating BM25 embeddings that can be loaded onto vector databases. Only thiserror crate is required (the rayon crate is optional for parallelism).
Separation of concerns. Tokenisation and indexing are decoupled, allowing the dependent library/binary to choose hashing (e.g. Murmur3, dictionary, etc.) and tokeniser strategies.
No duplicate indices/values. The final embedding vector contains unique indices.
Support for BM25+ delta (δ) parameter to ensure minimum contribution from matching terms.
Reproducible indices/values. This implementation avoids HashMap to guarantee deterministic results (e.g. downstream unit tests).
rayon crate.The file example.rs provides an example of implementing a Murmur3 indexer and a tokenizer that performs the following steps: