Crates.io | kl-hyphenate |
lib.rs | kl-hyphenate |
version | 0.7.3 |
source | src |
created_at | 2019-08-09 10:09:44.841589 |
updated_at | 2020-05-20 09:56:31.810791 |
description | Knuth-Liang hyphenation for a variety of languages |
homepage | https://github.com/baskerville/kl-hyphenate |
repository | https://github.com/baskerville/kl-hyphenate |
max_upload_size | |
id | 155334 |
size | 71,229 |
Two strategies are available:
The dictionaries can be built with:
cargo build -vv --features build_dictionaries
The resulting dictionaries are saved in the dictionaries
directory.
You can then load and use a dictionary with:
use kl_hyphenate::{Standard, Hyphenator, Language, Load};
let path_to_dict = "dictionaries/en-us.standard.bincode";
let en_us = Standard::from_path(Language::EnglishUS, path_to_dict) ?;
// Identify valid breaks in the given word.
let hyphenated = en_us.hyphenate("hyphenation");
// Word breaks are represented as byte indices into the string.
let break_indices = &hyphenated.breaks;
assert_eq!(break_indices, &[2, 6, 7]);
// The segments of a hyphenated word can be iterated over.
let segments = hyphenated.into_iter().segments();
let collected : Vec<_> = segments.collect();
assert_eq!(collected, vec!["hy", "phen", "a", "tion"]);
// `hyphenate()` is case-insensitive.
let uppercase : Vec<_> = en_us.hyphenate("CAPITAL").into_iter().collect();
assert_eq!(uppercase, vec!["CAP-", "I-", "TAL"]);
Dictionaries can be used in conjunction with text segmentation to hyphenate words within a text run. This short example uses the unicode-segmentation
crate for untailored Unicode segmentation.
use unicode_segmentation::UnicodeSegmentation;
let hyphenate_text = |text : &str| -> String {
// Split the text on word boundaries—
text.split_word_bounds()
// —and hyphenate each word individually.
.flat_map(|word| en_us.hyphenate(word).into_iter())
.collect()
};
let excerpt = "I know noble accents / And lucid, inescapable rhythms; […]";
assert_eq!("I know no-ble ac-cents / And lu-cid, in-escapable rhythms; […]"
, hyphenate_text(excerpt));
Hyphenation patterns for languages affected by normalization occasionally cover multiple forms, at the discretion of their authors, but most often they don’t. If you require kl-hyphenate
to operate strictly on strings in a known normalization form, as described by the Unicode Standard Annex #15 and provided by the unicode-normalization
crate, you may specify it in your Cargo manifest, like so:
[dependencies.kl-hyphenate]
version = "…"
features = ["nfc"]
The features
field may contain exactly one of the following normalization options:
"nfc"
, for canonical composition;"nfd"
, for canonical decomposition;"nfkc"
, for compatibility composition;"nfkd"
, for compatibility decomposition.It is recommended to build kl-hyphenate
in release mode if normalization is enabled, since the bundled hyphenation patterns will need to be reprocessed into dictionaries.
Dual-licensed under the terms of either:
hyph-utf8
hyphenation patterns © their respective owners; see their master files for licensing information.
patterns/hyph-hu.ext.txt
(extended Hungarian hyphenation patterns) is licensed under:
patterns/hyph-hu.ext.lic.txt
)patterns/hyph-ca.ext.txt
(extended Catalan hyphenation patterns) is licensed under:
patterns/hyph-ca.ext.lic.txt
)