Crates.io | tantivy-analysis-contrib |
lib.rs | tantivy-analysis-contrib |
version | 0.12.4 |
source | src |
created_at | 2022-03-22 22:45:18.414746 |
updated_at | 2024-11-07 19:15:30.251701 |
description | A set of analysis components for Tantivy |
homepage | https://github.com/Dalvany/tantivy-analysis-contrib |
repository | https://github.com/Dalvany/tantivy-analysis-contrib |
max_upload_size | |
id | 554927 |
size | 601,330 |
This is a collection of Tokenizer
and TokenFilters
for Tantivy that aims to
replicate features available in Lucene.
It relies on Google's Rust ICU. libicu-dev
and clang needs to be installed in order to compile.
Breaking word rules are from Lucene.
icu
feature includes the following components (they are also features) :
ICUTokenizer
ICUNormalizer2TokenFilter
ICUTransformTokenFilter
commons
features includes the following components
LengthTokenFilter
LimitTokenCountFilter
PathTokenizer
ReverseTokenFilter
ElisionTokenFilter
EdgeNgramTokenFilter
phonetic
feature includes some phonetic algorithm (Beider-Morse, Soundex, Metaphone, ... see
crate documentation)
PhoneticTokenFilter
embedded
which enables embedded rules of rphonetic crate. This feature is not included by default. It has two
sub-features embedded-bm
that enables only embedded Beider-Morse rules, and embedded-dm
which enables only
Daitch-Mokotoff rules.Note that phonetic support probably needs improvements.
By default, icu
, commons
and phonetic
are included.
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::{IndexRecordOption, SchemaBuilder, TextFieldIndexing, TextOptions, Value};
use tantivy::tokenizer::TextAnalyzer;
use tantivy::{doc, Index, ReloadPolicy, TantivyDocument};
use tantivy_analysis_contrib::icu::{Direction, ICUTokenizer, ICUTransformTokenFilter};
const ANALYSIS_NAME: &str = "test";
fn main() -> Result<(), Box<dyn std::error::Error>> {
let options = TextOptions::default()
.set_indexing_options(
TextFieldIndexing::default()
.set_tokenizer(ANALYSIS_NAME)
.set_index_option(IndexRecordOption::WithFreqsAndPositions),
)
.set_stored();
let mut schema = SchemaBuilder::new();
schema.add_text_field("field", options);
let schema = schema.build();
let transform = ICUTransformTokenFilter::new(
"Any-Latin; NFD; [:Nonspacing Mark:] Remove; Lower; NFC".to_string(),
None,
Direction::Forward,
)?;
let icu_analyzer = TextAnalyzer::builder(ICUTokenizer)
.filter(transform)
.build();
let field = schema.get_field("field").expect("Can't get field.");
let index = Index::create_in_ram(schema);
index.tokenizers().register(ANALYSIS_NAME, icu_analyzer);
let mut index_writer = index.writer(15_000_000)?;
index_writer.add_document(doc!(
field => "中国"
))?;
index_writer.add_document(doc!(
field => "Another Document"
))?;
index_writer.commit()?;
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::Manual)
.try_into()?;
let searcher = reader.searcher();
let query_parser = QueryParser::for_index(&index, vec![field]);
let query = query_parser.parse_query("zhong")?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
let mut result: Vec<String> = Vec::new();
for (_, doc_address) in top_docs {
let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
result = retrieved_doc
.get_all(field)
.map(|v| v.as_str().unwrap().to_string())
.collect();
}
let expected: Vec<String> = vec!["中国".to_string()];
assert_eq!(expected, result);
let query = query_parser.parse_query("国")?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
let mut result: Vec<String> = Vec::new();
for (_, doc_address) in top_docs {
let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
result = retrieved_doc
.get_all(field)
.map(|v| v.as_str().unwrap().to_string())
.collect();
}
let expected: Vec<String> = vec!["中国".to_string()];
assert_eq!(expected, result);
let query = query_parser.parse_query("document")?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
let mut result: Vec<String> = Vec::new();
for (_, doc_address) in top_docs {
let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
result = retrieved_doc
.get_all(field)
.map(|v| v.as_str().unwrap().to_string())
.collect();
}
let expected: Vec<String> = vec!["Another Document".to_string()];
assert_eq!(expected, result);
Ok(())
}
Licensed under either of
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.