litsea

Crates.io	litsea
lib.rs	litsea
version	0.3.2
created_at	2025-05-31 06:24:52.504815+00
updated_at	2025-10-16 13:44:59.309718+00
description	Litsea is an extreamely compact word segmentation and model training tool implemented in Rust.
homepage	https://github.com/mosuka/litsea/litsea
repository	https://github.com/mosuka/litsea
max_upload_size
id	1696092
size	103,881

Minoru OSUKA (mosuka)

documentation

https://docs.rs/litsea

README

Litsea

Litsea is an extremely compact word segmentation software implemented in Rust, inspired by TinySegmenter and TinySegmenterMaker. Unlike traditional morphological analyzers such as MeCab and Lindera, Litsea does not rely on large-scale dictionaries but instead performs segmentation using a compact pre-trained model. It features a fast and safe Rust implementation along with a learner designed to be simple and highly extensible.

There is a small plant called Litsea cubeba (Aomoji) in the same camphoraceae family as Lindera (Kuromoji). This is the origin of the name Litsea.

How to build Litsea

Litsea is implemented in Rust. To build it, follow these steps:

Prerequisites

Install Rust (stable channel) from rust-lang.org.
Ensure Cargo (Rust’s package manager) is available.

Build Instructions

Clone the Repository

If you haven't already cloned the repository, run:
```
git clone https://github.com/mosuka/litsea.git
cd litsea
```
Obtain Dependencies and Build

In the project's root directory, run:
```
cargo build --release
```
The --release flag produces an optimized build.
Verify the Build

Once complete, the executable will be in the target/release folder. Verify by running:
```
./target/release/litsea --help
```

Additional Notes

Using the latest stable Rust ensures compatibility with dependencies and allows use of modern features.
Run cargo update to refresh your dependencies if needed.

How to train models

Prepare a corpus with words separated by spaces in advance.

corpus.txt

Litsea は TinySegmenter を 参考 に 開発 さ れ た 、 Rust で 実装 さ れ た 極めて コンパクト な 単語 分割 ソフトウェア です 。

Extract the information and features from the corpus:

./target/release/litsea extract ./resources/corpus.txt ./resources/features.txt

The output from the extract command is similar to:

Feature extraction completed successfully.

Train the features output by the above command using AdaBoost. Training stops if the new weak classifier’s accuracy falls below 0.001 or after 10,000 iterations.

./target/release/litsea train -t 0.001 -i 10000 ./resources/features.txt ./resources/model

The output from the train command is similar to:

finding instances...: 61 instances found
loading instances...: 61/61 instances loaded
Iteration 9999 - margin: 0.16068839956263622
Result Metrics:
  Accuracy: 100.00% ( 61 / 61 )
  Precision: 100.00% ( 24 / 24 )
  Recall: 100.00% ( 24 / 24 )
  Confusion Matrix:
    True Positives: 24
    False Positives: 0
    False Negatives: 0
    True Negatives: 37

How to segment sentences into words

Use the trained model to segment sentences:

echo "LitseaはTinySegmenterを参考に開発された、Rustで実装された極めてコンパクトな単語分割ソフトウェアです。" | ./target/release/litsea segment ./resources/model

The output will look like:

Litsea は TinySegmenter を 参考 に 開発 さ れ た 、 Rust で 実装 さ れ た 極めて コンパクト な 単語 分割 ソフトウェア です 。

Pre-trained models

JEITA_Genpaku_ChaSen_IPAdic.model
This model is trained using the morphologically analyzed corpus published by the Japan Electronics and Information Technology Industries Association (JEITA). It employs data from [Project Sugita Genpaku] analyzed with ChaSen+IPAdic.
RWCP.model
Extracted from the original TinySegmenter, this model contains only the segmentation component.

How to retrain existing models

You can further improve performance by resuming training from an existing model with new corpora:

./target/release/litsea train -t 0.001 -i 10000 -m ./resources/model ./resources/new_features.txt ./resources/new_model

License

This project is distributed under the MIT License.
It also contains code originally developed by Taku Kudo and released under the BSD 3-Clause License.
See the LICENSE file for details.

Commit count: 44