| Crates.io | opencc-fmmseg |
| lib.rs | opencc-fmmseg |
| version | 0.8.4 |
| created_at | 2025-07-10 07:45:27.833722+00 |
| updated_at | 2026-01-04 02:36:34.073259+00 |
| description | High-performance OpenCC-based Chinese conversion using FMM (Forward Maximum Matching) segmentation. |
| homepage | https://github.com/laisuk/opencc-fmmseg |
| repository | https://github.com/laisuk/opencc-fmmseg |
| max_upload_size | |
| id | 1746008 |
| size | 3,293,467 |
opencc-fmmseg is a high-performance Rust-based engine for Chinese text conversion.
It combines OpenCC's lexicons with an
optimized Forward Maximum Matching (FMM) algorithm, suitable for:
use opencc_fmmseg::OpenCC;
fn main() {
let input = "汉字转换测试";
let opencc = OpenCC::new();
let output = opencc.convert(input, "s2t", false);
println!("{}", output); // 漢字轉換測試
}
Grab the latest version for your platform from the Releases page:
| Platform | Download Link |
|---|---|
| 🪟 Windows | opencc-fmmseg-{latest}-windows-x64.zip |
| 🐧 Linux | opencc-fmmseg-{latest}-linux-x64.tar.gz |
| 🍎 macOS | opencc-fmmseg-{latest}-macos-arm64.tar.gz |
Each archive contains:
README.txt
version.txt
bin/ # Command-line tools
lib/ # Shared library (.dll / .so / .dylib)
include/ # C API header + C++ helper header
key_length_mask,
starter_len_mask), and zero-copy string views for near-native throughput.git clone https://github.com/laisuk/opencc-fmmseg
cd opencc-fmmseg
cargo build --release --workspace
The CLI tool will be located at:
target/release/
opencc-rs # CLI plain text and Office document text converter
opencc-clip # Convert from clipboard, auto detect config
dict-generate # Generate dictionary ZSTD, CBOR or JSON files
opencc-rs convertConvert plain text using OpenCC
Usage: opencc-rs.exe convert [OPTIONS] --config <config>
Options:
-i, --input <file> Input file (use stdin if omitted for non-office documents)
-o, --output <file> Output file (use stdout if omitted for non-office documents)
-c, --config <config> Conversion configuration [possible values: s2t, t2s, s2tw, tw2s, s2twp, tw2sp, s2hk, hk2s, t2tw, t2twp, t2hk, tw2t, tw2tp, hk2t, t2jp, jp2t]
-p, --punct Enable punctuation conversion
--in-enc <in_enc> Encoding for input [default: UTF-8]
--out-enc <out_enc> Encoding for output [default: UTF-8]
-h, --help Print help
opencc-rs officeConvert Office or EPUB documents using OpenCC
Usage: opencc-rs.exe office [OPTIONS] --config <config>
Options:
-i, --input <file> Input file (use stdin if omitted for non-office documents)
-o, --output <file> Output file (use stdout if omitted for non-office documents)
-c, --config <config> Conversion configuration [possible values: s2t, t2s, s2tw, tw2s, s2twp, tw2sp, s2hk, hk2s, t2tw, t2twp, t2hk, tw2t, tw2tp, hk2t, t2jp, jp2t]
-p, --punct Enable punctuation conversion
-f, --format <ext> Force document format: docx, odt, epub...
--keep-font Preserve original font styles
--auto-ext Infer format from file extension
-h, --help Print help
./opencc-rs convert -c s2t -i text_simplified.txt -o text_traditional.txt
.docx, .xlsx, .pptx, .odt, .ods, .odp, .epub./opencc-rs office -c s2t --punct --format docx -i doc_simplified.docx -o doc_traditional.docx
s2t – Simplified to Traditionals2tw – Simplified to Traditional Taiwans2twp – Simplified to Traditional Taiwan with idiomst2s – Traditional to Simplifiedtw2s – Traditional Taiwan to Simplifiedtw2sp – Traditional Taiwan to Simplified with idiomsBy default, it uses OpenCC's built-in lexicon paths. You can also provide your own lexicon dictionary generated by
dict-generate CLI tool.
You can also use opencc-fmmseg as a library:
To use opencc-fmmseg in your project, add this to your Cargo.toml:
[dependencies]
opencc-fmmseg = "0.8.4" # or latest version
Then use it in your code:
use opencc_fmmseg::{OpenCC};
use opencc_fmmseg::OpenccConfig;
fn main() {
// ---------------------------------------------------------------------
// Sample UTF-8 input (same spirit as C / C++ demos)
// ---------------------------------------------------------------------
let input_text = "意大利邻国法兰西罗浮宫里收藏的“蒙娜丽莎的微笑”画像是旷世之作。";
println!("Text:");
println!("{}", input_text);
println!();
// ---------------------------------------------------------------------
// Create OpenCC instance
// ---------------------------------------------------------------------
let converter = OpenCC::new();
// Detect script
let input_code = converter.zho_check(input_text);
println!("Text Code: {}", input_code);
// ---------------------------------------------------------------------
// Test 1: Legacy string-based config (convert)
// ---------------------------------------------------------------------
let config_str = "s2twp";
let punct = true;
println!();
println!(
"== Test 1: convert(config = \"{}\", punctuation = {}) ==",
config_str, punct
);
let output1 = converter.convert(input_text, config_str, punct);
println!("Converted:");
println!("{}", output1);
println!("Converted Code: {}", converter.zho_check(&output1));
println!(
"Last Error: {}",
OpenCC::get_last_error().unwrap_or_else(|| "<none>".to_string())
);
// ---------------------------------------------------------------------
// Test 2: Strongly typed config (convert_with_config)
// ---------------------------------------------------------------------
let config_enum = OpenccConfig::S2twp;
println!();
println!(
"== Test 2: convert_with_config(config = {:?}, punctuation = {}) ==",
config_enum, punct
);
let output2 = converter.convert_with_config(input_text, config_enum, punct);
println!("Converted:");
println!("{}", output2);
println!("Converted Code: {}", converter.zho_check(&output2));
println!(
"Last Error: {}",
OpenCC::get_last_error().unwrap_or_else(|| "<none>".to_string())
);
// ---------------------------------------------------------------------
// Test 3: Invalid config (string) — self-protected
// ---------------------------------------------------------------------
let invalid_config = "what_is_this";
println!();
println!(
"== Test 3: invalid string config (\"{}\") ==",
invalid_config
);
let output3 = converter.convert(input_text, invalid_config, true);
println!("Returned:");
println!("{}", output3);
println!(
"Last Error: {}",
OpenCC::get_last_error().unwrap_or_else(|| "<none>".to_string())
);
// ---------------------------------------------------------------------
// Test 4: Clear last error and verify state reset
// ---------------------------------------------------------------------
println!();
println!("== Test 4: clear_last_error() ==");
OpenCC::clear_last_error();
println!(
"Last Error after clear: {}",
OpenCC::get_last_error().unwrap_or_else(|| "<none>".to_string())
);
// ---------------------------------------------------------------------
// Summary
// ---------------------------------------------------------------------
println!();
println!("All tests completed.");
}
Output:
Text:
意大利邻国法兰西罗浮宫里收藏的“蒙娜丽莎的微笑”画像是旷世之作。
Text Code: 2
== Test 1: convert(config = "s2twp", punctuation = true) ==
Converted:
義大利鄰國法蘭西羅浮宮裡收藏的「蒙娜麗莎的微笑」畫像是曠世之作。
Converted Code: 1
Last Error: <none>
== Test 2: convert_with_config(config = S2twp, punctuation = true) ==
Converted:
義大利鄰國法蘭西羅浮宮裡收藏的「蒙娜麗莎的微笑」畫像是曠世之作。
Converted Code: 1
Last Error: <none>
== Test 3: invalid string config ("what_is_this") ==
Returned:
Invalid config: what_is_this
Last Error: Invalid config: what_is_this
== Test 4: clear_last_error() ==
Last Error after clear: <none>
📦 Crate: opencc-fmmseg on crates.io
📄 Docs: docs.rs/opencc-fmmseg
opencc_fmmseg_capi)You can also use opencc-fmmseg via a C API for integration with C/C++ projects.
The zip includes:
opencc_fmmseg_capi.{so,dylib,dll}opencc_fmmseg_capi.hOpenccFmmsegHelper.hppYou can link against the shared library and call the segmentation/convert functions from any C or C++ project.
#include <stdio.h>
#include "opencc_fmmseg_capi.h"
int main(void) {
void *handle = opencc_new();
const char *config = "s2t";
const char *input = u8"汉字";
char *result = opencc_convert(handle, input, config, false);
printf("Input : %s\n", input);
printf("Converted: %s\n", result);
opencc_string_free(result);
opencc_delete(handle);
return 0;
}
##include <stdio.h>
#include <stdbool.h>
#include "opencc_fmmseg_capi.h"
int main(int argc, char **argv) {
void *opencc = opencc_new();
bool is_parallel = opencc_get_parallel(opencc);
printf("OpenCC is_parallel: %d\n", is_parallel);
const char *config = u8"s2twp";
const char *text = u8"意大利邻国法兰西罗浮宫里收藏的“蒙娜丽莎的微笑”画像是旷世之作。";
printf("Text: %s\n", text);
int code = opencc_zho_check(opencc, text);
printf("Text Code: %d\n", code);
char *result = opencc_convert(opencc, text, config, true);
code = opencc_zho_check(opencc, result);
char *last_error = opencc_last_error();
printf("Converted: %s\n", result);
printf("Text Code: %d\n", code);
printf("Last Error: %s\n", last_error == NULL ? "No error" : last_error);
if (last_error != NULL) opencc_error_free(last_error);
if (result != NULL) opencc_string_free(result);
opencc_delete(opencc);
return 0;
}
OpenCC is_parallel: 1
Text: 意大利邻国法兰西罗浮宫里收藏的“蒙娜丽莎的微笑”画像是旷世之作。
Text Code: 2
Converted: 義大利鄰國法蘭西羅浮宮裡收藏的「蒙娜麗莎的微笑」畫像是曠世之作。
Text Code: 1
Last Error: No error
opencc_new() creates and initializes a new OpenCC-FMMSEG instance.
opencc_convert(...) is the legacy string-based API:
"s2t", "t2s", "s2twp"."Invalid config: ...") is returned.opencc_convert_cfg(...) is the recommended API for new code:
opencc_config_t) instead of strings.opencc_convert_cfg_mem(...) is an advanced buffer-based API:
'\0') is reported via out_required.out_buf = NULL and out_cap = 0.char*-returning APIs.All input and output strings use null-terminated UTF-8 encoding.
punctuation accepts standard C Boolean values (true / false)
via <stdbool.h>.
opencc_string_free(...) must be used to free strings returned by:
opencc_convert(...)opencc_convert_cfg(...)opencc_last_error(...)opencc_error_free(...) frees memory returned by opencc_last_error() only.
It does not clear the internal error state.
opencc_clear_last_error() clears the internal error state:
opencc_last_error() will return "No error".opencc_error_free().opencc_last_error() returns the most recent error message:
"No error" if no error is recorded.opencc_error_free().opencc_delete(...) destroys the OpenCC instance and frees its resources.
opencc_zho_check(...) detects the script of the input text:
1 = Traditional Chinese2 = Simplified Chinese0 = Other / UndeterminedParallel mode can be queried using opencc_get_parallel() and modified
using opencc_set_parallel(...).
src/lib.rs – Main library with segmentation logic.capi/opencc-fmmseg-capi C API source and demo.tools/opencc-rs/src/main.rs – CLI tool (opencc-rs) implementation.dicts/ – OpenCC text lexicons which converted into Zstd compressed CBOR format.opencc-fmmseg Conversion SpeedTested using Criterion.rs on 1.2 million characters with
punctuation disabled (punctuation = false), built in release mode with Rayon enabled.
Results from v0.8.3:
| Input Size | s2t Mean Time | t2s Mean Time |
|---|---|---|
| 100 | 3.62 µs | 2.05 µs |
| 1,000 | 38.06 µs | 33.50 µs |
| 10,000 | 202.66 µs | 130.58 µs |
| 100,000 | 1.096 ms | 0.686 ms |
| 1,000,000 | 12.822 ms | 9.089 ms |
📊 Throughput Interpretation
At this level, CPU saturation is negligible — I/O or interop overhead (file/clipboard/network) now dominates
runtime.
The new mask-first gating (key_length_mask + starter_len_mask) delivers perfect O(n) scaling and
ultra-stable parallel throughput across large text corpora.
