tokenizer

Crates.iotokenizer
lib.rstokenizer
version0.1.2
sourcesrc
created_at2020-05-30 08:34:05.64926
updated_at2020-05-31 08:38:32.37509
descriptionThai text tokenizer
homepage
repositoryhttps://github.com/NattapongSiri/tokenizer_rs/tree/0.1.1
max_upload_size
id247636
size65,309
(NattapongSiri)

documentation

README

tokenizer_rs

A word tokenizer write purely on Rust. It's currently have two tokenizers.

  1. en - A space based tokenizer where each word is splitted by whitespace
  2. th - A dictionary based tokenizer with "maximum matching" algorithm and some basic unknown word handling by minimizing a number of unknown characters until some known word(s) are found.

It currently support two feature gate:

  • multi-thread - It will attempt to use multi-thread for tokenization.
  • single-thread - It will use single thread.

As currently is, Thai word tokenizer support both features. It use Rayon to do multi-thread tokenization. It simply split text by white space first then on each chunk, attempt tokenization on each chunk on separate thread using Rayon parallel iterator.

English language doesn't actually leverage multi-thread yet but it will work on both feature.

By default, it will use multi-thread

How to use

Put following line in your cargo.toml dependencies section. For example:

[dependencies]
tokenizer = "^0.1"

It will attempt to use multi-thread to do tokenization.

To force single-thread, use single-thread feature.

[dependencies]
tokenizer = { version = "^0.1", features = ["single-thread"] }

An example of Thai text tokenization:

use tokenizer::{Tokenizer, th};
let tokenizer = th::Tokenizer::new("path/to/dictionary.txt").expect("Dictionary file not found");
// Assuming dictinoary contains "ภาษาไทย" and "นิดเดียว" but not "ง่าย"
assert_eq!(tokenizer.tokenize("ภาษาไทยง่ายนิดเดียว"), vec!["ภาษาไทย", "ง่าย", "นิดเดียว"]);

Sample implementation using Lexitron dictionary

I have create a sample of code to calculate F1-score on 10 Monte Carlo simulation test where each test use a sample size of 200 and keep 10% of that sample out of tokenizer to test the quality of tokenizer when there is 10% unknown word in text.

That repository use Lexitron dictionary from NECTEC. Before you use, you should read their license agreement first.

I have also create a Monte Carlo simulation which took entire dictionary, shuffle it, took 90% of it to create tokenizer, reshuffle it and use all of it to calculate F1 score. The repository can be found here

Commit count: 0

cargo fmt