created_at2017-12-27 15:16:42.013756
updated_at2024-05-12 18:30:26.520734
descriptionWord segmentation/breaking library
Vee Satayamas (veer66)



Moved to https://codeberg.org/mekong-lang/wordcut-engine


Word segmentation library in Rust


use wordcut_engine::load_dict;
use wordcut_engine::Wordcut;
use std::path::Path;

fn main() {
    let dict_path = Path::new(concat!(
    let dict = load_dict(dict_path).unwrap();
    let wordcut = Wordcut::new(dict);
    println!("{}", wordcut.put_delimiters("หมากินไก่", "|"));

For Evcxr

Load this project as a dependency

:dep .

Import symbols

use wordcut_engine::load_dict;
use wordcut_engine::Wordcut;
use wordcut_engine::Dict;
use std::path::Path;


let dict: Dict = load_dict("data/thai.txt").unwrap();
let wordcut = Wordcut::new(dict);


let txt = "หมากินไก่";
wordcut.put_delimiters(txt, "|")
wordcut.build_path(txt, &txt.chars().collect::<Vec<_>>())
dbg!(wordcut.build_path(txt, &txt.chars().collect::<Vec<_>>()));


wordcut-engine has three steps:

  1. Identifying clusters, which are substrings that must not be split
  2. Identifying edges of split directed acyclic graph (split-DAG); The program does not add edges that break any cluster to the graph.
  3. Tokenizing a string by finding the shortest path in the split-DAG

Identifying clusters

Identifying clusters identify which substrings that must not be split.

  1. Wrapping regular expressions with parentheses

For example,


The above rules are wrapped with parentheses as shown below:

  1. Joining regular expressions with vertical bars (|)

for example,

  1. Building a DFA from the joined regular expression using regex-automata

  2. Creating a directed acyclic graph (DAG) by adding edges using the DFA

  3. Identifying clusters following a shortest path of a DAG from step above

Note: wordcut-engine does not allow a context sensitive rule, since it hurts the performance too much. Moreover, instead of longest matching, we use a DAG, and its shortest path to contraint cluster boundary by another cluster, therefore newmm-style context sensitive rules are not required.

Identifying split-DAG edges

In contrary to identifying clusters, identifying split-DAG edges identify what must be split. Split-DAG edge makers, wordcut-engine has three types of split-DAG edge maker, that are:

  1. Dictionary-based maker
  2. Rule-based maker
  3. Default maker (Unk edge builder)

The dictionary-based maker traverses a prefix tree, which is particularly a trie in wordcut-engine and create an edge that matched word in the prefix tree. Rule-based maker uses regex-automata's Regex matcher built from split rules to find longest matched substrings, and add corresponding edges to the graph. wordcut-engine removes edges that break clusters. The example of split rules are shown below:

[\r\t\n ]+

If there is no edge for each of character indice yet, a default maker create a edge that connected the known rightmost boundary.

Commit count: 0

cargo fmt