| Crates.io | tf-idf-vectorizer |
| lib.rs | tf-idf-vectorizer |
| version | 0.5.0 |
| created_at | 2025-02-08 17:59:46.822612+00 |
| updated_at | 2025-09-24 13:06:18.674115+00 |
| description | A simple search and analyze engine |
| homepage | |
| repository | https://github.com/371tti/tf-idf-vectorizer |
| max_upload_size | |
| id | 1548252 |
| size | 1,783,033 |
Ultra-flexible & high-speed document analysis engine in Rust
lang [ en | ja ]
Supports everything from corpus construction → TF calculation → IDF calculation → TF-IDF vectorization / similarity search.
TFIDFData) for persistenceSimilarityAlgorithm, Hits) for searchCargo.toml
[dependencies]
tf-idf-vectorizer = "0.4.3" # This README is for v0.4.x
use std::{sync::Arc, vec};
use tf_idf_vectorizer::{Corpus, SimilarityAlgorithm, TFIDFVectorizer, TokenFrequency};
fn main() {
// build corpus
let corpus = Arc::new(Corpus::new());
// add documents
let mut freq1 = TokenFrequency::new();
freq1.add_tokens(&["rust", "高速", "並列", "rust"]);
let mut freq2 = TokenFrequency::new();
freq2.add_tokens(&["rust", "柔軟", "安全", "rust"]);
// build query
let mut vectorizer: TFIDFVectorizer<u16> = TFIDFVectorizer::new(corpus);
vectorizer.add_doc("doc1".to_string(), &freq1);
vectorizer.add_doc("doc2".to_string(), &freq2);
// similarity search
let mut query_tokens = TokenFrequency::new();
query_tokens.add_tokens(&["rust", "高速"]);
let algorithm = SimilarityAlgorithm::CosineSimilarity;
let mut result = vectorizer.similarity(&query_tokens, &algorithm);
result.sort_by_score();
// print result
result.list.iter().for_each(|(k, s, l)| {
println!("doc: {}, score: {}, length: {}", k, s, l);
});
// debug
println!("result count: {}", result.list.len());
println!("{:?}", vectorizer);
}
TFIDFVectorizer contains references and cannot be deserialized directly.
Serialize as TFIDFData, and restore with into_tf_idf_vectorizer(Arc<Corpus>).
You can use any corpus for restoration; if the index contains tokens not in the corpus, they are ignored.
// Save
let dump = serde_json::to_string(&vectorizer)?;
// Restore
let data: TFIDFData = serde_json::from_str(&dump)?;
let restored = data.into_tf_idf_vectorizer(&corpus);
SimilarityAlgorithm)HitsYou can inject your own scoring function by replacing the implemented Compare trait / DefaultCompare.
token_dim_sample / token_dim_set) to avoid rebuildingtf.zip(idf).map(...))| Type | Role |
|---|---|
| Corpus | Document set meta / frequency getter |
| TokenFrequency | Token frequency in a single document |
| TFVector | Sparse TF vector for one document |
| IDFVector | Global IDF and meta |
| TFIDFVectorizer | TF/IDF management and search entry |
| TFIDFData | Intermediate for serialization |
| DefaultTFIDFEngine | TF/IDF calculation backend |
| SimilarityAlgorithm / Hits | Search query and results |
Compare traitTFIDFEngine for different weighting schemesRun the minimal example with:
cargo run --example basic