litsea

Crates.iolitsea
lib.rslitsea
version0.2.0
created_at2025-05-31 06:24:52.504815+00
updated_at2025-06-04 13:35:44.511727+00
descriptionLitsea is an extreamely compact word segmentation and model training tool implemented in Rust.
homepagehttps://github.com/mosuka/litsea
repositoryhttps://github.com/mosuka/litsea
max_upload_size
id1696092
size142,936
Minoru OSUKA (mosuka)

documentation

https://docs.rs/litsea

README

Litsea

Litsea is an extremely compact word segmentation software implemented in Rust, inspired by TinySegmenter and TinySegmenterMaker. Unlike traditional morphological analyzers such as MeCab and Lindera, Litsea does not rely on large-scale dictionaries but instead performs segmentation using a compact pre-trained model. It features a fast and safe Rust implementation along with a learner designed to be simple and highly extensible.

There is a small plant called Litsea cubeba (Aomoji) in the same camphoraceae family as Lindera (Kuromoji). This is the origin of the name Litsea.

How to build Litsea

Litsea is implemented in Rust. To build it, follow these steps:

Prerequisites

  • Install Rust (stable channel) from rust-lang.org.
  • Ensure Cargo (Rust’s package manager) is available.

Build Instructions

  1. Clone the Repository

    If you haven't already cloned the repository, run:

    git clone https://github.com/mosuka/litsea.git
    cd litsea
    
  2. Obtain Dependencies and Build

    In the project's root directory, run:

    cargo build --release
    

    The --release flag produces an optimized build.

  3. Verify the Build

    Once complete, the executable will be in the target/release folder. Verify by running:

    ./target/release/litsea --help
    

Additional Notes

  • Using the latest stable Rust ensures compatibility with dependencies and allows use of modern features.
  • Run cargo update to refresh your dependencies if needed.

How to train models

Prepare a corpus with words separated by spaces in advance.

  • corpus.txt

    Litsea は TinySegmenter を 参考 に 開発 さ れ た 、 Rust で 実装 さ れ た 極めて コンパクト な 単語 分割 ソフトウェア です 。
    
    

Extract the information and features from the corpus:

./target/release/litsea extract ./resources/corpus.txt ./resources/features.txt

The output from the extract command is similar to:

Feature extraction completed successfully.

Train the features output by the above command using AdaBoost. Training stops if the new weak classifier’s accuracy falls below 0.001 or after 10,000 iterations.

./target/release/litsea train -t 0.001 -i 10000 ./resources/features.txt ./resources/model

The output from the train command is similar to:

finding instances...: 61 instances found
loading instances...: 61/61 instances loaded
Iteration 9999 - margin: 0.16068839956263622
Result Metrics:
  Accuracy: 100.00% ( 61 / 61 )
  Precision: 100.00% ( 24 / 24 )
  Recall: 100.00% ( 24 / 24 )
  Confusion Matrix:
    True Positives: 24
    False Positives: 0
    False Negatives: 0
    True Negatives: 37

How to segment sentences into words

Use the trained model to segment sentences:

echo "LitseaはTinySegmenterを参考に開発された、Rustで実装された極めてコンパクトな単語分割ソフトウェアです。" | ./target/release/litsea segment ./resources/model

The output will look like:

Litsea は TinySegmenter を 参考 に 開発 さ れ た 、 Rust で 実装 さ れ た 極めて コンパクト な 単語 分割 ソフトウェア です 。

Pre-trained models

  • JEITA_Genpaku_ChaSen_IPAdic.model
    This model is trained using the morphologically analyzed corpus published by the Japan Electronics and Information Technology Industries Association (JEITA). It employs data from [Project Sugita Genpaku] analyzed with ChaSen+IPAdic.

  • RWCP.model
    Extracted from the original TinySegmenter, this model contains only the segmentation component.

How to retrain existing models

You can further improve performance by resuming training from an existing model with new corpora:

./target/release/litsea train -t 0.001 -i 10000 -m ./resources/model ./resources/new_features.txt ./resources/new_model
Commit count: 44

cargo fmt