| Crates.io | cang-jie |
| lib.rs | cang-jie |
| version | 0.18.0 |
| created_at | 2018-09-18 14:35:08.671671+00 |
| updated_at | 2023-11-04 12:49:28.873055+00 |
| description | A Chinese tokenizer for tantivy |
| homepage | |
| repository | https://github.com/DCjanus/cang-jie |
| max_upload_size | |
| id | 85364 |
| size | 13,506 |
A Chinese tokenizer for tantivy, based on jieba-rs.
As of now, only support UTF-8.
let mut schema_builder = SchemaBuilder::default();
let text_indexing = TextFieldIndexing::default()
.set_tokenizer(CANG_JIE) // Set custom tokenizer
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
.set_indexing_options(text_indexing)
.set_stored();
// ... Some code
let index = Index::create(RAMDirectory::create(), schema.clone())?;
let tokenizer = CangJieTokenizer {
worker: Arc::new(Jieba::empty()), // empty dictionary
option: TokenizerOption::Unicode,
};
index.tokenizers().register(CANG_JIE, tokenizer);
// ... Some code