Crates.io | seekstorm |
lib.rs | seekstorm |
version | 0.8.0 |
source | src |
created_at | 2024-03-17 06:49:54.218738 |
updated_at | 2024-10-28 09:57:26.926871 |
description | Search engine library & multi-tenancy server |
homepage | https://seekstorm.com |
repository | https://github.com/SeekStorm/SeekStorm |
max_upload_size | |
id | 1176262 |
size | 1,127,088 |
SeekStorm is an open-source, sub-millisecond full-text search library & multi-tenancy server implemented in Rust.
Development started in 2015, in production since 2020, Rust port in 2023, open sourced in 2024, work in progress.
SeekStorm is open source licensed under under the Apache License 2.0
Blog Posts: SeekStorm is now Open Source and SeekStorm gets Faceted search, Geo proximity search, Result sorting
Query types
Result types
Performance
Lower latency, higher throughput, lower cost & energy consumption, esp. for multi-field and concurrent queries.
Low tail latencies ensure a smooth user experience and prevent loss of customers and revenue.
While some rely on proprietary hardware accelerators (FPGA/ASIC) or clusters to improve performance,
SeekStorm achieves a similar boost algorithmically on a single commodity server.
Consistency
No unpredictable query latency during and after large-volume indexing as SeekStorm doesn't require resource-intensive segment merges.
Stable latencies - no cold start costs due to just-in-time compilation, no unpredictable garbage collection delays.
Scaling
Maintains low latency, high throughput, and low RAM consumption even for billion-scale indices.
Unlimited field number, field length & index size.
Relevance
Term proximity ranking provides more relevant results compared to BM25.
Real-time
True real-time search, as opposed to NRT: every indexed document is immediately searchable, even before and during commit.
the who: vanilla BM25 ranking vs. SeekStorm proximity ranking
Methodology
Comparing different open-source search engine libraries (BM25 lexical search) using the open-source search_benchmark_game developed by Tantivy and Jason Wolfe.
Benefits
Detailed benchmark results https://seekstorm.github.io/search-benchmark-game/
Benchmark code repository https://github.com/SeekStorm/search-benchmark-game/
See our blog posts for more detailed information: SeekStorm is now Open Source and SeekStorm gets Faceted search, Geo proximity search, Result sorting
Despite what the hype-cycles https://www.bitecode.dev/p/hype-cycles want you to believe, keyword search is not dead, as NoSQL wasn't the death of SQL.
You should maintain a toolbox, and choose the best tool for your task at hand. https://seekstorm.com/blog/vector-search-vs-keyword-search1/
Keyword search is just a filter for a set of documents, returning those where certain keywords occur in, usually combined with a ranking metric like BM25. A very basic and core functionality, that is very challenging to implement at scale with low latency. Because the functionality is so basic, there is an unlimited number of application fields. It is a component, to be used together with other components. There are uses cases which can be solved better today with vector search and LLMs, but for many more keyword search is still the best solution. Keyword search is exact, lossless, and it is very fast, with better scaling, better latency, lower cost and energy consumption. Vector search works with semantic similarity, returning results within with a given proximity and probability.
If you search for exact results like proper names, numbers, license plates, domain names, and phrases (e.g. plagiarism detection) then keyword search is your friend. Vector search on the other hand will bury the exact result that you are looking for among a myriad results that are only somehow semantically related. At the same time, if you don’t know the exact terms, or you are interested in a broader topic, meaning or synonym, no matter what exact terms are used, then keyword search will fail you.
- works with text data only
- unable to capture context, meaning and semantic similarity
- low recall for semantic meaning
+ perfect recall for exact keyword match
+ perfect precision (for exact keyword match)
+ high query speed and throughput (for large document numbers)
+ high indexing speed (for large document numbers)
+ incremental indexing fully supported
+ smaller index size
+ lower infrastructure cost per document and per query, lower energy consumption
+ good scalability (for large document numbers)
+ perfect for exact keyword and phrase search, no false positives
+ perfect explainability
+ efficient and lossless for exact keyword and phrase search
+ works with new vocabulary out of the box
+ works with any language out of the box
+ works perfect with long-tail vocabulary out of the box
+ works perfect with any rare language or domain-specific vocabulary out of the box
+ RAG (Retrieval-augmented generation) based on keyword search offers unrestricted real-time capabilities.
Vector search is perfect if you don’t know the exact query terms, or you are interested in a broader topic, meaning or synonym, no matter what exact query terms are used. But if you are looking for exact terms, e.g. proper names, numbers, license plates, domain names, and phrases (e.g. plagiarism detection) then you should always use keyword search. Vector search will but bury the exact result that you are looking for among a myriad results that are only somehow related. It has a good recall, but low precision, and higher latency. It is prone to false positives, e.g. in in plagiarism detection as exact words and word order get lost.
Vector search enables you to search not only for similar text, but everything that can be transformed to a vector: text, images (face recognition, finger prints), audio and it enables you to do magic things like queen - woman + man = king.
+ works with any data that can be transformed to a vector: text, image, audio ...
+ able to capture context, meaning, and semantic similarity
+ high recall for semantic meaning (90%)
- lower recall for exact keyword match (for Approximate Similarity Search)
- lower precision (for exact keyword match)
- lower query speed and throughput (for large document numbers)
- lower indexing speed (for large document numbers)
- incremental indexing is expensive and requires rebuilding the entire index periodically, which is extremely time-consuming and resource intensive.
- larger index size
- higher infrastructure cost per document and per query, higher energy consumption
- limited scalability (for large document numbers)
- unsuitable for exact keyword and phrase search, many false positives
- low explainability makes it difficult to spot manipulations, bias and root cause of retrieval/ranking problems
- inefficient and lossy for exact keyword and phrase search
- Additional effort and cost to create embeddings and keep them updated for every language and domain. Even if the number of indexed documents is small, the embeddings have to created from a large corpus before nevertheless.
- Limited real-time capability due to limited recency of embeddings
- works only with vocabulary known at the time of embedding creation
- works only with the languages of the corpus from which the embeddings have been derived
- works only with long-tail vocabulary that was sufficiently represented in the corpus from which the embeddings have been derived
- works only with rare language or domain-specific vocabulary that was sufficiently represented in the corpus from which the embeddings have been derived
- RAG (Retrieval-augmented generation) based on vector search offers only limited real-time capabilities, as it can't process new vocabulary that arrived after the embedding generation
Vector search is not a replacement for keyword search, but a complementary addition - best to be used within a hybrid solution where the strengths of both approaches are combined. Keyword search is not outdated, but time-proven.
We have (partially) ported the SeekStorm codebase from C# to Rust
Rust is great for performance-critical applications 🚀 that deal with big data and/or many concurrent users. Fast algorithms will shine even more with a performance-conscious programming language 🙂
see ARCHITECTURE.md
cargo build --release
⚠ WARNING: make sure to set the MASTER_KEY_SECRET environment variable to a secret, otherwise your generated API keys will be compromised.
Build documentation
cargo doc --no-deps
Access documentation locally
SeekStorm\target\doc\seekstorm\index.html
SeekStorm\target\doc\seekstorm_server\index.html
Add required crates to your project
cargo add seekstorm
cargo add tokio
cargo add serde_json
use std::{collections::HashSet, error::Error, path::Path, sync::Arc};
use seekstorm::{index::*,search::*,highlighter::*,commit::Commit};
use tokio::sync::RwLock;
use an asynchronous Rust runtime
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error + Send + Sync>> {
create index
let index_path=Path::new("C:/index/");
let schema_json = r#"
[{"field":"title","field_type":"Text","stored":false,"indexed":false},
{"field":"body","field_type":"Text","stored":true,"indexed":true},
{"field":"url","field_type":"Text","stored":false,"indexed":false}]"#;
let schema=serde_json::from_str(schema_json).unwrap();
let meta = IndexMetaObject {
id: 0,
name: "test_index".to_string(),
similarity:SimilarityType::Bm25f,
tokenizer:TokenizerType::AsciiAlphabetic,
access_type: AccessType::Mmap,
};
let serialize_schema=true;
let segment_number_bits1=11;
let index=create_index(index_path,meta,&schema,serialize_schema,segment_number_bits1,false).unwrap();
let _index_arc = Arc::new(RwLock::new(index));
open index (alternatively to create index)
let index_path=Path::new("C:/index/");
let mut index_arc=open_index(index_path,false).await.unwrap();
index documents
let documents_json = r#"
[{"title":"title1 test","body":"body1","url":"url1"},
{"title":"title2","body":"body2 test","url":"url2"},
{"title":"title3 test","body":"body3 test","url":"url3"}]"#;
let documents_vec=serde_json::from_str(documents_json).unwrap();
index_arc.index_documents(documents_vec).await;
commit documents
index_arc.commit().await;
search index
let query="test".to_string();
let offset=0;
let length=10;
let query_type=QueryType::Intersection;
let result_type=ResultType::TopkCount;
let include_uncommitted=false;
let field_filter=Vec::new();
let result_object = index_arc.search(query, query_type, offset, length, result_type,include_uncommitted,field_filter).await;
display results
let highlights:Vec<Highlight>= vec![
Highlight {
field: "body".to_string(),
name:String::new(),
fragment_number: 2,
fragment_size: 160,
highlight_markup: true,
},
];
let highlighter=Some(highlighter(highlights, result_object.query_term_strings));
let return_fields_filter= HashSet::new();
let mut index=index_arc.write().await;
for result in result_object.results.iter() {
let doc=index.get_document(result.doc_id,false,&highlighter,&return_fields_filter).unwrap();
println!("result {} rank {} body field {:?}" , result.doc_id,result.score, doc.get("body"));
}
multi-threaded search
let query_vec=vec!["house".to_string(),"car".to_string(),"bird".to_string(),"sky".to_string()];
let offset=0;
let length=10;
let query_type=QueryType::Union;
let result_type=ResultType::TopkCount;
let thread_number = 4;
let permits = Arc::new(Semaphore::new(thread_number));
for query in query_vec {
let permit_thread = permits.clone().acquire_owned().await.unwrap();
let query_clone = query.clone();
let index_arc_clone = index_arc.clone();
let query_type_clone = query_type.clone();
let result_type_clone = result_type.clone();
let offset_clone = offset;
let length_clone = length;
tokio::spawn(async move {
let rlo = index_arc_clone
.search(
query_clone,
query_type_clone,
offset_clone,
length_clone,
result_type_clone,
false,
Vec::new(),
)
.await;
println!("result count {}", rlo.result_count);
drop(permit_thread);
});
}
clear index
index.clear_index();
delete index
index.delete_index();
close index
index.close_index();
seekstorm library version string
let version=version();
println!("version {}",version);
Facets are defined in 3 different places:
A minimal working example of faceted indexing & search requires just 60 lines of code. But to puzzle it all together from the documentation alone might be tedious. This is why we provide a quick start example here:
Add required crates to your project
cargo add seekstorm
cargo add tokio
cargo add serde_json
Add use declarations
use std::{collections::HashSet, error::Error, path::Path, sync::Arc};
use seekstorm::{index::*,search::*,highlighter::*,commit::Commit};
use tokio::sync::RwLock;
use an asynchronous Rust runtime
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error + Send + Sync>> {
create index
let index_path=Path::new("C:/index/");//x
let schema_json = r#"
[{"field":"title","field_type":"Text","stored":false,"indexed":false},
{"field":"body","field_type":"Text","stored":true,"indexed":true},
{"field":"url","field_type":"Text","stored":true,"indexed":false},
{"field":"town","field_type":"String","stored":false,"indexed":false,"facet":true}]"#;
let schema=serde_json::from_str(schema_json).unwrap();
let meta = IndexMetaObject {
id: 0,
name: "test_index".to_string(),
similarity:SimilarityType::Bm25f,
tokenizer:TokenizerType::AsciiAlphabetic,
access_type: AccessType::Mmap,
};
let serialize_schema=true;
let segment_number_bits1=11;
let index=create_index(index_path,meta,&schema,serialize_schema,segment_number_bits1,false).unwrap();
let mut index_arc = Arc::new(RwLock::new(index));
index documents
let documents_json = r#"
[{"title":"title1 test","body":"body1","url":"url1","town":"Berlin"},
{"title":"title2","body":"body2 test","url":"url2","town":"Warsaw"},
{"title":"title3 test","body":"body3 test","url":"url3","town":"New York"}]"#;
let documents_vec=serde_json::from_str(documents_json).unwrap();
index_arc.index_documents(documents_vec).await;
commit documents
index_arc.commit().await;
search index
let query="test".to_string();
let offset=0;
let length=10;
let query_type=QueryType::Intersection;
let result_type=ResultType::TopkCount;
let include_uncommitted=false;
let field_filter=Vec::new();
let query_facets = vec![QueryFacet::String {field: "age".to_string(),prefix: "".to_string(),length:u16::MAX}];
let facet_filter=Vec::new();
//let facet_filter = vec![FacetFilter::String { field: "town".to_string(),filter: vec!["Berlin".to_string()],}];
let facet_result_sort=Vec::new();
let result_object = index_arc.search(query, query_type, offset, length, result_type,include_uncommitted,field_filter,query_facets,facet_filter).await;
display results
let highlights:Vec<Highlight>= vec![
Highlight {
field: "body".to_owned(),
name:String::new(),
fragment_number: 2,
fragment_size: 160,
highlight_markup: true,
},
];
let highlighter2=Some(highlighter(highlights, result_object.query_terms));
let return_fields_filter= HashSet::new();
let index=index_arc.write().await;
for result in result_object.results.iter() {
let doc=index.get_document(result.doc_id,false,&highlighter2,&return_fields_filter).unwrap();
println!("result {} rank {} body field {:?}" , result.doc_id,result.score, doc.get("body"));
}
display facets
println!("{}", serde_json::to_string_pretty(&result_object.facets).unwrap());
end of main function
Ok(())
}
A quick step-by-step tutorial on how to build a Wikipedia search engine from a Wikipedia corpus using the SeekStorm server in 5 easy steps.
Download SeekStorm
Download SeekStorm from the GitHub repository
Unzip in directory of your choice, open in Visual Studio code.
or alternatively
git clone https://github.com/SeekStorm/SeekStorm.git
Build SeekStorm
Install Rust (if not yet present): https://www.rust-lang.org/tools/install
In the terminal of Visual Studio Code type:
cargo build --release
Get Wikipedia corpus
Preprocessed English Wikipedia corpus (5,032,105 documents, 8,28 GB decompressed). Although wiki-articles.json has a .JSON extension, it is not a valid JSON file. It is a text file, where every line contains a JSON object with url, title and body attributes. The format is called ndjson ("Newline delimited JSON").
Decompresss Wikipedia corpus.
https://gnuwin32.sourceforge.net/packages/bzip2.htm
bunzip2 wiki-articles.json.bz2
Move the decompressed wiki-articles.json to the release directory
Start SeekStorm server
cd target/release
./seekstorm_server local_ip="0.0.0.0" local_port=80
Indexing
Type 'ingest' into the command line of the running SeekStorm server:
ingest
This creates the demo index and indexes the local wikipedia file.
Start searching within the embedded WebUI
Open embedded Web UI in browser: http://127.0.0.1
Enter a query into the search box
Testing the REST API endpoints
Open src/seekstorm_server/test_api.rest in VSC together with the VSC extension "Rest client" to execute API calls and inspect responses
interactive API endpoint examples
Set the 'individual API key' in test_api.rest to the api key displayed in the server console when you typed 'index' above.
Remove demo index
Type 'delete' into the command line of the running SeekStorm server:
delete
Shutdown server
Type 'quit' into the commandline of the running SeekStorm server.
quit
Customizing
Do you want to use something similar for your own project? Have a look at the ingest and web UI documentation.
Full-text search 30M Hacker News posts AND linked web pages
The DeepHN demo is still based on the SeekStorm C# codebase.
We are currently porting all required missing features.
See roadmap below.
The Rust port is not yet feature complete. The following features are currently ported.
Porting
Improvements
New features
Native vector search (currently PoC)
Distributed search cluster (currently PoC)